begriffs open source - ai-pg/blob - full-docs/txt/unaccent.txt

   1
   2 F.48. unaccent — a text search dictionary which removes diacritics #
   3
   4    F.48.1. Configuration
   5    F.48.2. Usage
   6    F.48.3. Functions
   7
   8    unaccent is a text search dictionary that removes accents (diacritic
   9    signs) from lexemes. It's a filtering dictionary, which means its
  10    output is always passed to the next dictionary (if any), unlike the
  11    normal behavior of dictionaries. This allows accent-insensitive
  12    processing for full text search.
  13
  14    The current implementation of unaccent cannot be used as a normalizing
  15    dictionary for the thesaurus dictionary.
  16
  17    This module is considered “trusted”, that is, it can be installed by
  18    non-superusers who have CREATE privilege on the current database.
  19
  20 F.48.1. Configuration #
  21
  22    An unaccent dictionary accepts the following options:
  23      * RULES is the base name of the file containing the list of
  24        translation rules. This file must be stored in
  25        $SHAREDIR/tsearch_data/ (where $SHAREDIR means the PostgreSQL
  26        installation's shared-data directory). Its name must end in .rules
  27        (which is not to be included in the RULES parameter).
  28
  29    The rules file has the following format:
  30      * Each line represents one translation rule, consisting of a
  31        character with accent followed by a character without accent. The
  32        first is translated into the second. For example,
  33 À        A
  34 Á        A
  35 Â        A
  36 Ã        A
  37 Ä        A
  38 Å        A
  39 Æ        AE
  40
  41        The two characters must be separated by whitespace, and any leading
  42        or trailing whitespace on a line is ignored.
  43      * Alternatively, if only one character is given on a line, instances
  44        of that character are deleted; this is useful in languages where
  45        accents are represented by separate characters.
  46      * Actually, each “character” can be any string not containing
  47        whitespace, so unaccent dictionaries could be used for other sorts
  48        of substring substitutions besides diacritic removal.
  49      * Some characters, like numeric symbols, may require whitespaces in
  50        their translation rule. It is possible to use double quotes around
  51        the translated characters in this case. A double quote needs to be
  52        escaped with a second double quote when including one in the
  53        translated character. For example:
  54 ¼      " 1/4"
  55 ½      " 1/2"
  56 ¾      " 3/4"
  57 “       """"
  58 ”       """"
  59
  60      * As with other PostgreSQL text search configuration files, the rules
  61        file must be stored in UTF-8 encoding. The data is automatically
  62        translated into the current database's encoding when loaded. Any
  63        lines containing untranslatable characters are silently ignored, so
  64        that rules files can contain rules that are not applicable in the
  65        current encoding.
  66
  67    A more complete example, which is directly useful for most European
  68    languages, can be found in unaccent.rules, which is installed in
  69    $SHAREDIR/tsearch_data/ when the unaccent module is installed. This
  70    rules file translates characters with accents to the same characters
  71    without accents, and it also expands ligatures into the equivalent
  72    series of simple characters (for example, Æ to AE).
  73
  74 F.48.2. Usage #
  75
  76    Installing the unaccent extension creates a text search template
  77    unaccent and a dictionary unaccent based on it. The unaccent dictionary
  78    has the default parameter setting RULES='unaccent', which makes it
  79    immediately usable with the standard unaccent.rules file. If you wish,
  80    you can alter the parameter, for example
  81 mydb=# ALTER TEXT SEARCH DICTIONARY unaccent (RULES='my_rules');
  82
  83    or create new dictionaries based on the template.
  84
  85    To test the dictionary, you can try:
  86 mydb=# select ts_lexize('unaccent','Hôtel');
  87  ts_lexize
  88 -----------
  89  {Hotel}
  90 (1 row)
  91
  92    Here is an example showing how to insert the unaccent dictionary into a
  93    text search configuration:
  94 mydb=# CREATE TEXT SEARCH CONFIGURATION fr ( COPY = french );
  95 mydb=# ALTER TEXT SEARCH CONFIGURATION fr
  96         ALTER MAPPING FOR hword, hword_part, word
  97         WITH unaccent, french_stem;
  98 mydb=# select to_tsvector('fr','Hôtels de la Mer');
  99     to_tsvector
 100 -------------------
 101  'hotel':1 'mer':4
 102 (1 row)
 103
 104 mydb=# select to_tsvector('fr','Hôtel de la Mer') @@ to_tsquery('fr','Hotels');
 105  ?column?
 106 ----------
 107  t
 108 (1 row)
 109
 110 mydb=# select ts_headline('fr','Hôtel de la Mer',to_tsquery('fr','Hotels'));
 111       ts_headline
 112 ------------------------
 113  <b>Hôtel</b> de la Mer
 114 (1 row)
 115
 116 F.48.3. Functions #
 117
 118    The unaccent() function removes accents (diacritic signs) from a given
 119    string. Basically, it's a wrapper around unaccent-type dictionaries,
 120    but it can be used outside normal text search contexts.
 121 unaccent([dictionary regdictionary, ] string text) returns text
 122
 123    If the dictionary argument is omitted, the text search dictionary named
 124    unaccent and appearing in the same schema as the unaccent() function
 125    itself is used.
 126
 127    For example:
 128 SELECT unaccent('unaccent', 'Hôtel');
 129 SELECT unaccent('Hôtel');