begriffs open source - ai-pg/blob - full-docs/txt/textsearch-dictionaries.txt

   1
   2 12.6. Dictionaries #
   3
   4    12.6.1. Stop Words
   5    12.6.2. Simple Dictionary
   6    12.6.3. Synonym Dictionary
   7    12.6.4. Thesaurus Dictionary
   8    12.6.5. Ispell Dictionary
   9    12.6.6. Snowball Dictionary
  10
  11    Dictionaries are used to eliminate words that should not be considered
  12    in a search (stop words), and to normalize words so that different
  13    derived forms of the same word will match. A successfully normalized
  14    word is called a lexeme. Aside from improving search quality,
  15    normalization and removal of stop words reduce the size of the tsvector
  16    representation of a document, thereby improving performance.
  17    Normalization does not always have linguistic meaning and usually
  18    depends on application semantics.
  19
  20    Some examples of normalization:
  21      * Linguistic — Ispell dictionaries try to reduce input words to a
  22        normalized form; stemmer dictionaries remove word endings
  23      * URL locations can be canonicalized to make equivalent URLs match:
  24           + http://www.pgsql.ru/db/mw/index.html
  25           + http://www.pgsql.ru/db/mw/
  26           + http://www.pgsql.ru/db/../db/mw/index.html
  27      * Color names can be replaced by their hexadecimal values, e.g., red,
  28        green, blue, magenta -> FF0000, 00FF00, 0000FF, FF00FF
  29      * If indexing numbers, we can remove some fractional digits to reduce
  30        the range of possible numbers, so for example 3.14159265359,
  31        3.1415926, 3.14 will be the same after normalization if only two
  32        digits are kept after the decimal point.
  33
  34    A dictionary is a program that accepts a token as input and returns:
  35      * an array of lexemes if the input token is known to the dictionary
  36        (notice that one token can produce more than one lexeme)
  37      * a single lexeme with the TSL_FILTER flag set, to replace the
  38        original token with a new token to be passed to subsequent
  39        dictionaries (a dictionary that does this is called a filtering
  40        dictionary)
  41      * an empty array if the dictionary knows the token, but it is a stop
  42        word
  43      * NULL if the dictionary does not recognize the input token
  44
  45    PostgreSQL provides predefined dictionaries for many languages. There
  46    are also several predefined templates that can be used to create new
  47    dictionaries with custom parameters. Each predefined dictionary
  48    template is described below. If no existing template is suitable, it is
  49    possible to create new ones; see the contrib/ area of the PostgreSQL
  50    distribution for examples.
  51
  52    A text search configuration binds a parser together with a set of
  53    dictionaries to process the parser's output tokens. For each token type
  54    that the parser can return, a separate list of dictionaries is
  55    specified by the configuration. When a token of that type is found by
  56    the parser, each dictionary in the list is consulted in turn, until
  57    some dictionary recognizes it as a known word. If it is identified as a
  58    stop word, or if no dictionary recognizes the token, it will be
  59    discarded and not indexed or searched for. Normally, the first
  60    dictionary that returns a non-NULL output determines the result, and
  61    any remaining dictionaries are not consulted; but a filtering
  62    dictionary can replace the given word with a modified word, which is
  63    then passed to subsequent dictionaries.
  64
  65    The general rule for configuring a list of dictionaries is to place
  66    first the most narrow, most specific dictionary, then the more general
  67    dictionaries, finishing with a very general dictionary, like a Snowball
  68    stemmer or simple, which recognizes everything. For example, for an
  69    astronomy-specific search (astro_en configuration) one could bind token
  70    type asciiword (ASCII word) to a synonym dictionary of astronomical
  71    terms, a general English dictionary and a Snowball English stemmer:
  72 ALTER TEXT SEARCH CONFIGURATION astro_en
  73     ADD MAPPING FOR asciiword WITH astrosyn, english_ispell, english_stem;
  74
  75    A filtering dictionary can be placed anywhere in the list, except at
  76    the end where it'd be useless. Filtering dictionaries are useful to
  77    partially normalize words to simplify the task of later dictionaries.
  78    For example, a filtering dictionary could be used to remove accents
  79    from accented letters, as is done by the unaccent module.
  80
  81 12.6.1. Stop Words #
  82
  83    Stop words are words that are very common, appear in almost every
  84    document, and have no discrimination value. Therefore, they can be
  85    ignored in the context of full text searching. For example, every
  86    English text contains words like a and the, so it is useless to store
  87    them in an index. However, stop words do affect the positions in
  88    tsvector, which in turn affect ranking:
  89 SELECT to_tsvector('english', 'in the list of stop words');
  90         to_tsvector
  91 ----------------------------
  92  'list':3 'stop':5 'word':6
  93
  94    The missing positions 1,2,4 are because of stop words. Ranks calculated
  95    for documents with and without stop words are quite different:
  96 SELECT ts_rank_cd (to_tsvector('english', 'in the list of stop words'), to_tsque
  97 ry('list & stop'));
  98  ts_rank_cd
  99 ------------
 100        0.05
 101
 102 SELECT ts_rank_cd (to_tsvector('english', 'list stop words'), to_tsquery('list &
 103  stop'));
 104  ts_rank_cd
 105 ------------
 106         0.1
 107
 108    It is up to the specific dictionary how it treats stop words. For
 109    example, ispell dictionaries first normalize words and then look at the
 110    list of stop words, while Snowball stemmers first check the list of
 111    stop words. The reason for the different behavior is an attempt to
 112    decrease noise.
 113
 114 12.6.2. Simple Dictionary #
 115
 116    The simple dictionary template operates by converting the input token
 117    to lower case and checking it against a file of stop words. If it is
 118    found in the file then an empty array is returned, causing the token to
 119    be discarded. If not, the lower-cased form of the word is returned as
 120    the normalized lexeme. Alternatively, the dictionary can be configured
 121    to report non-stop-words as unrecognized, allowing them to be passed on
 122    to the next dictionary in the list.
 123
 124    Here is an example of a dictionary definition using the simple
 125    template:
 126 CREATE TEXT SEARCH DICTIONARY public.simple_dict (
 127     TEMPLATE = pg_catalog.simple,
 128     STOPWORDS = english
 129 );
 130
 131    Here, english is the base name of a file of stop words. The file's full
 132    name will be $SHAREDIR/tsearch_data/english.stop, where $SHAREDIR means
 133    the PostgreSQL installation's shared-data directory, often
 134    /usr/local/share/postgresql (use pg_config --sharedir to determine it
 135    if you're not sure). The file format is simply a list of words, one per
 136    line. Blank lines and trailing spaces are ignored, and upper case is
 137    folded to lower case, but no other processing is done on the file
 138    contents.
 139
 140    Now we can test our dictionary:
 141 SELECT ts_lexize('public.simple_dict', 'YeS');
 142  ts_lexize
 143 -----------
 144  {yes}
 145
 146 SELECT ts_lexize('public.simple_dict', 'The');
 147  ts_lexize
 148 -----------
 149  {}
 150
 151    We can also choose to return NULL, instead of the lower-cased word, if
 152    it is not found in the stop words file. This behavior is selected by
 153    setting the dictionary's Accept parameter to false. Continuing the
 154    example:
 155 ALTER TEXT SEARCH DICTIONARY public.simple_dict ( Accept = false );
 156
 157 SELECT ts_lexize('public.simple_dict', 'YeS');
 158  ts_lexize
 159 -----------
 160
 161
 162 SELECT ts_lexize('public.simple_dict', 'The');
 163  ts_lexize
 164 -----------
 165  {}
 166
 167    With the default setting of Accept = true, it is only useful to place a
 168    simple dictionary at the end of a list of dictionaries, since it will
 169    never pass on any token to a following dictionary. Conversely, Accept =
 170    false is only useful when there is at least one following dictionary.
 171
 172 Caution
 173
 174    Most types of dictionaries rely on configuration files, such as files
 175    of stop words. These files must be stored in UTF-8 encoding. They will
 176    be translated to the actual database encoding, if that is different,
 177    when they are read into the server.
 178
 179 Caution
 180
 181    Normally, a database session will read a dictionary configuration file
 182    only once, when it is first used within the session. If you modify a
 183    configuration file and want to force existing sessions to pick up the
 184    new contents, issue an ALTER TEXT SEARCH DICTIONARY command on the
 185    dictionary. This can be a “dummy” update that doesn't actually change
 186    any parameter values.
 187
 188 12.6.3. Synonym Dictionary #
 189
 190    This dictionary template is used to create dictionaries that replace a
 191    word with a synonym. Phrases are not supported (use the thesaurus
 192    template (Section 12.6.4) for that). A synonym dictionary can be used
 193    to overcome linguistic problems, for example, to prevent an English
 194    stemmer dictionary from reducing the word “Paris” to “pari”. It is
 195    enough to have a Paris paris line in the synonym dictionary and put it
 196    before the english_stem dictionary. For example:
 197 SELECT * FROM ts_debug('english', 'Paris');
 198    alias   |   description   | token |  dictionaries  |  dictionary  | lexemes
 199 -----------+-----------------+-------+----------------+--------------+---------
 200  asciiword | Word, all ASCII | Paris | {english_stem} | english_stem | {pari}
 201
 202 CREATE TEXT SEARCH DICTIONARY my_synonym (
 203     TEMPLATE = synonym,
 204     SYNONYMS = my_synonyms
 205 );
 206
 207 ALTER TEXT SEARCH CONFIGURATION english
 208     ALTER MAPPING FOR asciiword
 209     WITH my_synonym, english_stem;
 210
 211 SELECT * FROM ts_debug('english', 'Paris');
 212    alias   |   description   | token |       dictionaries        | dictionary |
 213 lexemes
 214 -----------+-----------------+-------+---------------------------+------------+-
 215 --------
 216  asciiword | Word, all ASCII | Paris | {my_synonym,english_stem} | my_synonym |
 217 {paris}
 218
 219    The only parameter required by the synonym template is SYNONYMS, which
 220    is the base name of its configuration file — my_synonyms in the above
 221    example. The file's full name will be
 222    $SHAREDIR/tsearch_data/my_synonyms.syn (where $SHAREDIR means the
 223    PostgreSQL installation's shared-data directory). The file format is
 224    just one line per word to be substituted, with the word followed by its
 225    synonym, separated by white space. Blank lines and trailing spaces are
 226    ignored.
 227
 228    The synonym template also has an optional parameter CaseSensitive,
 229    which defaults to false. When CaseSensitive is false, words in the
 230    synonym file are folded to lower case, as are input tokens. When it is
 231    true, words and tokens are not folded to lower case, but are compared
 232    as-is.
 233
 234    An asterisk (*) can be placed at the end of a synonym in the
 235    configuration file. This indicates that the synonym is a prefix. The
 236    asterisk is ignored when the entry is used in to_tsvector(), but when
 237    it is used in to_tsquery(), the result will be a query item with the
 238    prefix match marker (see Section 12.3.2). For example, suppose we have
 239    these entries in $SHAREDIR/tsearch_data/synonym_sample.syn:
 240 postgres        pgsql
 241 postgresql      pgsql
 242 postgre pgsql
 243 gogle   googl
 244 indices index*
 245
 246    Then we will get these results:
 247 mydb=# CREATE TEXT SEARCH DICTIONARY syn (template=synonym, synonyms='synonym_sa
 248 mple');
 249 mydb=# SELECT ts_lexize('syn', 'indices');
 250  ts_lexize
 251 -----------
 252  {index}
 253 (1 row)
 254
 255 mydb=# CREATE TEXT SEARCH CONFIGURATION tst (copy=simple);
 256 mydb=# ALTER TEXT SEARCH CONFIGURATION tst ALTER MAPPING FOR asciiword WITH syn;
 257 mydb=# SELECT to_tsvector('tst', 'indices');
 258  to_tsvector
 259 -------------
 260  'index':1
 261 (1 row)
 262
 263 mydb=# SELECT to_tsquery('tst', 'indices');
 264  to_tsquery
 265 ------------
 266  'index':*
 267 (1 row)
 268
 269 mydb=# SELECT 'indexes are very useful'::tsvector;
 270             tsvector
 271 ---------------------------------
 272  'are' 'indexes' 'useful' 'very'
 273 (1 row)
 274
 275 mydb=# SELECT 'indexes are very useful'::tsvector @@ to_tsquery('tst', 'indices'
 276 );
 277  ?column?
 278 ----------
 279  t
 280 (1 row)
 281
 282 12.6.4. Thesaurus Dictionary #
 283
 284    A thesaurus dictionary (sometimes abbreviated as TZ) is a collection of
 285    words that includes information about the relationships of words and
 286    phrases, i.e., broader terms (BT), narrower terms (NT), preferred
 287    terms, non-preferred terms, related terms, etc.
 288
 289    Basically a thesaurus dictionary replaces all non-preferred terms by
 290    one preferred term and, optionally, preserves the original terms for
 291    indexing as well. PostgreSQL's current implementation of the thesaurus
 292    dictionary is an extension of the synonym dictionary with added phrase
 293    support. A thesaurus dictionary requires a configuration file of the
 294    following format:
 295 # this is a comment
 296 sample word(s) : indexed word(s)
 297 more sample word(s) : more indexed word(s)
 298 ...
 299
 300    where the colon (:) symbol acts as a delimiter between a phrase and its
 301    replacement.
 302
 303    A thesaurus dictionary uses a subdictionary (which is specified in the
 304    dictionary's configuration) to normalize the input text before checking
 305    for phrase matches. It is only possible to select one subdictionary. An
 306    error is reported if the subdictionary fails to recognize a word. In
 307    that case, you should remove the use of the word or teach the
 308    subdictionary about it. You can place an asterisk (*) at the beginning
 309    of an indexed word to skip applying the subdictionary to it, but all
 310    sample words must be known to the subdictionary.
 311
 312    The thesaurus dictionary chooses the longest match if there are
 313    multiple phrases matching the input, and ties are broken by using the
 314    last definition.
 315
 316    Specific stop words recognized by the subdictionary cannot be
 317    specified; instead use ? to mark the location where any stop word can
 318    appear. For example, assuming that a and the are stop words according
 319    to the subdictionary:
 320 ? one ? two : swsw
 321
 322    matches a one the two and the one a two; both would be replaced by
 323    swsw.
 324
 325    Since a thesaurus dictionary has the capability to recognize phrases it
 326    must remember its state and interact with the parser. A thesaurus
 327    dictionary uses these assignments to check if it should handle the next
 328    word or stop accumulation. The thesaurus dictionary must be configured
 329    carefully. For example, if the thesaurus dictionary is assigned to
 330    handle only the asciiword token, then a thesaurus dictionary definition
 331    like one 7 will not work since token type uint is not assigned to the
 332    thesaurus dictionary.
 333
 334 Caution
 335
 336    Thesauruses are used during indexing so any change in the thesaurus
 337    dictionary's parameters requires reindexing. For most other dictionary
 338    types, small changes such as adding or removing stopwords does not
 339    force reindexing.
 340
 341 12.6.4.1. Thesaurus Configuration #
 342
 343    To define a new thesaurus dictionary, use the thesaurus template. For
 344    example:
 345 CREATE TEXT SEARCH DICTIONARY thesaurus_simple (
 346     TEMPLATE = thesaurus,
 347     DictFile = mythesaurus,
 348     Dictionary = pg_catalog.english_stem
 349 );
 350
 351    Here:
 352      * thesaurus_simple is the new dictionary's name
 353      * mythesaurus is the base name of the thesaurus configuration file.
 354        (Its full name will be $SHAREDIR/tsearch_data/mythesaurus.ths,
 355        where $SHAREDIR means the installation shared-data directory.)
 356      * pg_catalog.english_stem is the subdictionary (here, a Snowball
 357        English stemmer) to use for thesaurus normalization. Notice that
 358        the subdictionary will have its own configuration (for example,
 359        stop words), which is not shown here.
 360
 361    Now it is possible to bind the thesaurus dictionary thesaurus_simple to
 362    the desired token types in a configuration, for example:
 363 ALTER TEXT SEARCH CONFIGURATION russian
 364     ALTER MAPPING FOR asciiword, asciihword, hword_asciipart
 365     WITH thesaurus_simple;
 366
 367 12.6.4.2. Thesaurus Example #
 368
 369    Consider a simple astronomical thesaurus thesaurus_astro, which
 370    contains some astronomical word combinations:
 371 supernovae stars : sn
 372 crab nebulae : crab
 373
 374    Below we create a dictionary and bind some token types to an
 375    astronomical thesaurus and English stemmer:
 376 CREATE TEXT SEARCH DICTIONARY thesaurus_astro (
 377     TEMPLATE = thesaurus,
 378     DictFile = thesaurus_astro,
 379     Dictionary = english_stem
 380 );
 381
 382 ALTER TEXT SEARCH CONFIGURATION russian
 383     ALTER MAPPING FOR asciiword, asciihword, hword_asciipart
 384     WITH thesaurus_astro, english_stem;
 385
 386    Now we can see how it works. ts_lexize is not very useful for testing a
 387    thesaurus, because it treats its input as a single token. Instead we
 388    can use plainto_tsquery and to_tsvector which will break their input
 389    strings into multiple tokens:
 390 SELECT plainto_tsquery('supernova star');
 391  plainto_tsquery
 392 -----------------
 393  'sn'
 394
 395 SELECT to_tsvector('supernova star');
 396  to_tsvector
 397 -------------
 398  'sn':1
 399
 400    In principle, one can use to_tsquery if you quote the argument:
 401 SELECT to_tsquery('''supernova star''');
 402  to_tsquery
 403 ------------
 404  'sn'
 405
 406    Notice that supernova star matches supernovae stars in thesaurus_astro
 407    because we specified the english_stem stemmer in the thesaurus
 408    definition. The stemmer removed the e and s.
 409
 410    To index the original phrase as well as the substitute, just include it
 411    in the right-hand part of the definition:
 412 supernovae stars : sn supernovae stars
 413
 414 SELECT plainto_tsquery('supernova star');
 415        plainto_tsquery
 416 -----------------------------
 417  'sn' & 'supernova' & 'star'
 418
 419 12.6.5. Ispell Dictionary #
 420
 421    The Ispell dictionary template supports morphological dictionaries,
 422    which can normalize many different linguistic forms of a word into the
 423    same lexeme. For example, an English Ispell dictionary can match all
 424    declensions and conjugations of the search term bank, e.g., banking,
 425    banked, banks, banks', and bank's.
 426
 427    The standard PostgreSQL distribution does not include any Ispell
 428    configuration files. Dictionaries for a large number of languages are
 429    available from Ispell. Also, some more modern dictionary file formats
 430    are supported — MySpell (OO < 2.0.1) and Hunspell (OO >= 2.0.2). A
 431    large list of dictionaries is available on the OpenOffice Wiki.
 432
 433    To create an Ispell dictionary perform these steps:
 434      * download dictionary configuration files. OpenOffice extension files
 435        have the .oxt extension. It is necessary to extract .aff and .dic
 436        files, change extensions to .affix and .dict. For some dictionary
 437        files it is also needed to convert characters to the UTF-8 encoding
 438        with commands (for example, for a Norwegian language dictionary):
 439 iconv -f ISO_8859-1 -t UTF-8 -o nn_no.affix nn_NO.aff
 440 iconv -f ISO_8859-1 -t UTF-8 -o nn_no.dict nn_NO.dic
 441
 442      * copy files to the $SHAREDIR/tsearch_data directory
 443      * load files into PostgreSQL with the following command:
 444 CREATE TEXT SEARCH DICTIONARY english_hunspell (
 445     TEMPLATE = ispell,
 446     DictFile = en_us,
 447     AffFile = en_us,
 448     Stopwords = english);
 449
 450    Here, DictFile, AffFile, and StopWords specify the base names of the
 451    dictionary, affixes, and stop-words files. The stop-words file has the
 452    same format explained above for the simple dictionary type. The format
 453    of the other files is not specified here but is available from the
 454    above-mentioned web sites.
 455
 456    Ispell dictionaries usually recognize a limited set of words, so they
 457    should be followed by another broader dictionary; for example, a
 458    Snowball dictionary, which recognizes everything.
 459
 460    The .affix file of Ispell has the following structure:
 461 prefixes
 462 flag *A:
 463     .           >   RE      # As in enter > reenter
 464 suffixes
 465 flag T:
 466     E           >   ST      # As in late > latest
 467     [^AEIOU]Y   >   -Y,IEST # As in dirty > dirtiest
 468     [AEIOU]Y    >   EST     # As in gray > grayest
 469     [^EY]       >   EST     # As in small > smallest
 470
 471    And the .dict file has the following structure:
 472 lapse/ADGRS
 473 lard/DGRS
 474 large/PRTY
 475 lark/MRS
 476
 477    Format of the .dict file is:
 478 basic_form/affix_class_name
 479
 480    In the .affix file every affix flag is described in the following
 481    format:
 482 condition > [-stripping_letters,] adding_affix
 483
 484    Here, condition has a format similar to the format of regular
 485    expressions. It can use groupings [...] and [^...]. For example,
 486    [AEIOU]Y means that the last letter of the word is "y" and the
 487    penultimate letter is "a", "e", "i", "o" or "u". [^EY] means that the
 488    last letter is neither "e" nor "y".
 489
 490    Ispell dictionaries support splitting compound words; a useful feature.
 491    Notice that the affix file should specify a special flag using the
 492    compoundwords controlled statement that marks dictionary words that can
 493    participate in compound formation:
 494 compoundwords  controlled z
 495
 496    Here are some examples for the Norwegian language:
 497 SELECT ts_lexize('norwegian_ispell', 'overbuljongterningpakkmesterassistent');
 498    {over,buljong,terning,pakk,mester,assistent}
 499 SELECT ts_lexize('norwegian_ispell', 'sjokoladefabrikk');
 500    {sjokoladefabrikk,sjokolade,fabrikk}
 501
 502    MySpell format is a subset of Hunspell. The .affix file of Hunspell has
 503    the following structure:
 504 PFX A Y 1
 505 PFX A   0     re         .
 506 SFX T N 4
 507 SFX T   0     st         e
 508 SFX T   y     iest       [^aeiou]y
 509 SFX T   0     est        [aeiou]y
 510 SFX T   0     est        [^ey]
 511
 512    The first line of an affix class is the header. Fields of an affix
 513    rules are listed after the header:
 514      * parameter name (PFX or SFX)
 515      * flag (name of the affix class)
 516      * stripping characters from beginning (at prefix) or end (at suffix)
 517        of the word
 518      * adding affix
 519      * condition that has a format similar to the format of regular
 520        expressions.
 521
 522    The .dict file looks like the .dict file of Ispell:
 523 larder/M
 524 lardy/RT
 525 large/RSPMYT
 526 largehearted
 527
 528 Note
 529
 530    MySpell does not support compound words. Hunspell has sophisticated
 531    support for compound words. At present, PostgreSQL implements only the
 532    basic compound word operations of Hunspell.
 533
 534 12.6.6. Snowball Dictionary #
 535
 536    The Snowball dictionary template is based on a project by Martin
 537    Porter, inventor of the popular Porter's stemming algorithm for the
 538    English language. Snowball now provides stemming algorithms for many
 539    languages (see the Snowball site for more information). Each algorithm
 540    understands how to reduce common variant forms of words to a base, or
 541    stem, spelling within its language. A Snowball dictionary requires a
 542    language parameter to identify which stemmer to use, and optionally can
 543    specify a stopword file name that gives a list of words to eliminate.
 544    (PostgreSQL's standard stopword lists are also provided by the Snowball
 545    project.) For example, there is a built-in definition equivalent to
 546 CREATE TEXT SEARCH DICTIONARY english_stem (
 547     TEMPLATE = snowball,
 548     Language = english,
 549     StopWords = english
 550 );
 551
 552    The stopword file format is the same as already explained.
 553
 554    A Snowball dictionary recognizes everything, whether or not it is able
 555    to simplify the word, so it should be placed at the end of the
 556    dictionary list. It is useless to have it before any other dictionary
 557    because a token will never pass through it to the next dictionary.