5 12.6.2. Simple Dictionary
6 12.6.3. Synonym Dictionary
7 12.6.4. Thesaurus Dictionary
8 12.6.5. Ispell Dictionary
9 12.6.6. Snowball Dictionary
11 Dictionaries are used to eliminate words that should not be considered
12 in a search (stop words), and to normalize words so that different
13 derived forms of the same word will match. A successfully normalized
14 word is called a lexeme. Aside from improving search quality,
15 normalization and removal of stop words reduce the size of the tsvector
16 representation of a document, thereby improving performance.
17 Normalization does not always have linguistic meaning and usually
18 depends on application semantics.
20 Some examples of normalization:
21 * Linguistic — Ispell dictionaries try to reduce input words to a
22 normalized form; stemmer dictionaries remove word endings
23 * URL locations can be canonicalized to make equivalent URLs match:
24 + http://www.pgsql.ru/db/mw/index.html
25 + http://www.pgsql.ru/db/mw/
26 + http://www.pgsql.ru/db/../db/mw/index.html
27 * Color names can be replaced by their hexadecimal values, e.g., red,
28 green, blue, magenta -> FF0000, 00FF00, 0000FF, FF00FF
29 * If indexing numbers, we can remove some fractional digits to reduce
30 the range of possible numbers, so for example 3.14159265359,
31 3.1415926, 3.14 will be the same after normalization if only two
32 digits are kept after the decimal point.
34 A dictionary is a program that accepts a token as input and returns:
35 * an array of lexemes if the input token is known to the dictionary
36 (notice that one token can produce more than one lexeme)
37 * a single lexeme with the TSL_FILTER flag set, to replace the
38 original token with a new token to be passed to subsequent
39 dictionaries (a dictionary that does this is called a filtering
41 * an empty array if the dictionary knows the token, but it is a stop
43 * NULL if the dictionary does not recognize the input token
45 PostgreSQL provides predefined dictionaries for many languages. There
46 are also several predefined templates that can be used to create new
47 dictionaries with custom parameters. Each predefined dictionary
48 template is described below. If no existing template is suitable, it is
49 possible to create new ones; see the contrib/ area of the PostgreSQL
50 distribution for examples.
52 A text search configuration binds a parser together with a set of
53 dictionaries to process the parser's output tokens. For each token type
54 that the parser can return, a separate list of dictionaries is
55 specified by the configuration. When a token of that type is found by
56 the parser, each dictionary in the list is consulted in turn, until
57 some dictionary recognizes it as a known word. If it is identified as a
58 stop word, or if no dictionary recognizes the token, it will be
59 discarded and not indexed or searched for. Normally, the first
60 dictionary that returns a non-NULL output determines the result, and
61 any remaining dictionaries are not consulted; but a filtering
62 dictionary can replace the given word with a modified word, which is
63 then passed to subsequent dictionaries.
65 The general rule for configuring a list of dictionaries is to place
66 first the most narrow, most specific dictionary, then the more general
67 dictionaries, finishing with a very general dictionary, like a Snowball
68 stemmer or simple, which recognizes everything. For example, for an
69 astronomy-specific search (astro_en configuration) one could bind token
70 type asciiword (ASCII word) to a synonym dictionary of astronomical
71 terms, a general English dictionary and a Snowball English stemmer:
72 ALTER TEXT SEARCH CONFIGURATION astro_en
73 ADD MAPPING FOR asciiword WITH astrosyn, english_ispell, english_stem;
75 A filtering dictionary can be placed anywhere in the list, except at
76 the end where it'd be useless. Filtering dictionaries are useful to
77 partially normalize words to simplify the task of later dictionaries.
78 For example, a filtering dictionary could be used to remove accents
79 from accented letters, as is done by the unaccent module.
83 Stop words are words that are very common, appear in almost every
84 document, and have no discrimination value. Therefore, they can be
85 ignored in the context of full text searching. For example, every
86 English text contains words like a and the, so it is useless to store
87 them in an index. However, stop words do affect the positions in
88 tsvector, which in turn affect ranking:
89 SELECT to_tsvector('english', 'in the list of stop words');
91 ----------------------------
92 'list':3 'stop':5 'word':6
94 The missing positions 1,2,4 are because of stop words. Ranks calculated
95 for documents with and without stop words are quite different:
96 SELECT ts_rank_cd (to_tsvector('english', 'in the list of stop words'), to_tsque
102 SELECT ts_rank_cd (to_tsvector('english', 'list stop words'), to_tsquery('list &
108 It is up to the specific dictionary how it treats stop words. For
109 example, ispell dictionaries first normalize words and then look at the
110 list of stop words, while Snowball stemmers first check the list of
111 stop words. The reason for the different behavior is an attempt to
114 12.6.2. Simple Dictionary #
116 The simple dictionary template operates by converting the input token
117 to lower case and checking it against a file of stop words. If it is
118 found in the file then an empty array is returned, causing the token to
119 be discarded. If not, the lower-cased form of the word is returned as
120 the normalized lexeme. Alternatively, the dictionary can be configured
121 to report non-stop-words as unrecognized, allowing them to be passed on
122 to the next dictionary in the list.
124 Here is an example of a dictionary definition using the simple
126 CREATE TEXT SEARCH DICTIONARY public.simple_dict (
127 TEMPLATE = pg_catalog.simple,
131 Here, english is the base name of a file of stop words. The file's full
132 name will be $SHAREDIR/tsearch_data/english.stop, where $SHAREDIR means
133 the PostgreSQL installation's shared-data directory, often
134 /usr/local/share/postgresql (use pg_config --sharedir to determine it
135 if you're not sure). The file format is simply a list of words, one per
136 line. Blank lines and trailing spaces are ignored, and upper case is
137 folded to lower case, but no other processing is done on the file
140 Now we can test our dictionary:
141 SELECT ts_lexize('public.simple_dict', 'YeS');
146 SELECT ts_lexize('public.simple_dict', 'The');
151 We can also choose to return NULL, instead of the lower-cased word, if
152 it is not found in the stop words file. This behavior is selected by
153 setting the dictionary's Accept parameter to false. Continuing the
155 ALTER TEXT SEARCH DICTIONARY public.simple_dict ( Accept = false );
157 SELECT ts_lexize('public.simple_dict', 'YeS');
162 SELECT ts_lexize('public.simple_dict', 'The');
167 With the default setting of Accept = true, it is only useful to place a
168 simple dictionary at the end of a list of dictionaries, since it will
169 never pass on any token to a following dictionary. Conversely, Accept =
170 false is only useful when there is at least one following dictionary.
174 Most types of dictionaries rely on configuration files, such as files
175 of stop words. These files must be stored in UTF-8 encoding. They will
176 be translated to the actual database encoding, if that is different,
177 when they are read into the server.
181 Normally, a database session will read a dictionary configuration file
182 only once, when it is first used within the session. If you modify a
183 configuration file and want to force existing sessions to pick up the
184 new contents, issue an ALTER TEXT SEARCH DICTIONARY command on the
185 dictionary. This can be a “dummy” update that doesn't actually change
186 any parameter values.
188 12.6.3. Synonym Dictionary #
190 This dictionary template is used to create dictionaries that replace a
191 word with a synonym. Phrases are not supported (use the thesaurus
192 template (Section 12.6.4) for that). A synonym dictionary can be used
193 to overcome linguistic problems, for example, to prevent an English
194 stemmer dictionary from reducing the word “Paris” to “pari”. It is
195 enough to have a Paris paris line in the synonym dictionary and put it
196 before the english_stem dictionary. For example:
197 SELECT * FROM ts_debug('english', 'Paris');
198 alias | description | token | dictionaries | dictionary | lexemes
199 -----------+-----------------+-------+----------------+--------------+---------
200 asciiword | Word, all ASCII | Paris | {english_stem} | english_stem | {pari}
202 CREATE TEXT SEARCH DICTIONARY my_synonym (
204 SYNONYMS = my_synonyms
207 ALTER TEXT SEARCH CONFIGURATION english
208 ALTER MAPPING FOR asciiword
209 WITH my_synonym, english_stem;
211 SELECT * FROM ts_debug('english', 'Paris');
212 alias | description | token | dictionaries | dictionary |
214 -----------+-----------------+-------+---------------------------+------------+-
216 asciiword | Word, all ASCII | Paris | {my_synonym,english_stem} | my_synonym |
219 The only parameter required by the synonym template is SYNONYMS, which
220 is the base name of its configuration file — my_synonyms in the above
221 example. The file's full name will be
222 $SHAREDIR/tsearch_data/my_synonyms.syn (where $SHAREDIR means the
223 PostgreSQL installation's shared-data directory). The file format is
224 just one line per word to be substituted, with the word followed by its
225 synonym, separated by white space. Blank lines and trailing spaces are
228 The synonym template also has an optional parameter CaseSensitive,
229 which defaults to false. When CaseSensitive is false, words in the
230 synonym file are folded to lower case, as are input tokens. When it is
231 true, words and tokens are not folded to lower case, but are compared
234 An asterisk (*) can be placed at the end of a synonym in the
235 configuration file. This indicates that the synonym is a prefix. The
236 asterisk is ignored when the entry is used in to_tsvector(), but when
237 it is used in to_tsquery(), the result will be a query item with the
238 prefix match marker (see Section 12.3.2). For example, suppose we have
239 these entries in $SHAREDIR/tsearch_data/synonym_sample.syn:
246 Then we will get these results:
247 mydb=# CREATE TEXT SEARCH DICTIONARY syn (template=synonym, synonyms='synonym_sa
249 mydb=# SELECT ts_lexize('syn', 'indices');
255 mydb=# CREATE TEXT SEARCH CONFIGURATION tst (copy=simple);
256 mydb=# ALTER TEXT SEARCH CONFIGURATION tst ALTER MAPPING FOR asciiword WITH syn;
257 mydb=# SELECT to_tsvector('tst', 'indices');
263 mydb=# SELECT to_tsquery('tst', 'indices');
269 mydb=# SELECT 'indexes are very useful'::tsvector;
271 ---------------------------------
272 'are' 'indexes' 'useful' 'very'
275 mydb=# SELECT 'indexes are very useful'::tsvector @@ to_tsquery('tst', 'indices'
282 12.6.4. Thesaurus Dictionary #
284 A thesaurus dictionary (sometimes abbreviated as TZ) is a collection of
285 words that includes information about the relationships of words and
286 phrases, i.e., broader terms (BT), narrower terms (NT), preferred
287 terms, non-preferred terms, related terms, etc.
289 Basically a thesaurus dictionary replaces all non-preferred terms by
290 one preferred term and, optionally, preserves the original terms for
291 indexing as well. PostgreSQL's current implementation of the thesaurus
292 dictionary is an extension of the synonym dictionary with added phrase
293 support. A thesaurus dictionary requires a configuration file of the
296 sample word(s) : indexed word(s)
297 more sample word(s) : more indexed word(s)
300 where the colon (:) symbol acts as a delimiter between a phrase and its
303 A thesaurus dictionary uses a subdictionary (which is specified in the
304 dictionary's configuration) to normalize the input text before checking
305 for phrase matches. It is only possible to select one subdictionary. An
306 error is reported if the subdictionary fails to recognize a word. In
307 that case, you should remove the use of the word or teach the
308 subdictionary about it. You can place an asterisk (*) at the beginning
309 of an indexed word to skip applying the subdictionary to it, but all
310 sample words must be known to the subdictionary.
312 The thesaurus dictionary chooses the longest match if there are
313 multiple phrases matching the input, and ties are broken by using the
316 Specific stop words recognized by the subdictionary cannot be
317 specified; instead use ? to mark the location where any stop word can
318 appear. For example, assuming that a and the are stop words according
319 to the subdictionary:
322 matches a one the two and the one a two; both would be replaced by
325 Since a thesaurus dictionary has the capability to recognize phrases it
326 must remember its state and interact with the parser. A thesaurus
327 dictionary uses these assignments to check if it should handle the next
328 word or stop accumulation. The thesaurus dictionary must be configured
329 carefully. For example, if the thesaurus dictionary is assigned to
330 handle only the asciiword token, then a thesaurus dictionary definition
331 like one 7 will not work since token type uint is not assigned to the
332 thesaurus dictionary.
336 Thesauruses are used during indexing so any change in the thesaurus
337 dictionary's parameters requires reindexing. For most other dictionary
338 types, small changes such as adding or removing stopwords does not
341 12.6.4.1. Thesaurus Configuration #
343 To define a new thesaurus dictionary, use the thesaurus template. For
345 CREATE TEXT SEARCH DICTIONARY thesaurus_simple (
346 TEMPLATE = thesaurus,
347 DictFile = mythesaurus,
348 Dictionary = pg_catalog.english_stem
352 * thesaurus_simple is the new dictionary's name
353 * mythesaurus is the base name of the thesaurus configuration file.
354 (Its full name will be $SHAREDIR/tsearch_data/mythesaurus.ths,
355 where $SHAREDIR means the installation shared-data directory.)
356 * pg_catalog.english_stem is the subdictionary (here, a Snowball
357 English stemmer) to use for thesaurus normalization. Notice that
358 the subdictionary will have its own configuration (for example,
359 stop words), which is not shown here.
361 Now it is possible to bind the thesaurus dictionary thesaurus_simple to
362 the desired token types in a configuration, for example:
363 ALTER TEXT SEARCH CONFIGURATION russian
364 ALTER MAPPING FOR asciiword, asciihword, hword_asciipart
365 WITH thesaurus_simple;
367 12.6.4.2. Thesaurus Example #
369 Consider a simple astronomical thesaurus thesaurus_astro, which
370 contains some astronomical word combinations:
371 supernovae stars : sn
374 Below we create a dictionary and bind some token types to an
375 astronomical thesaurus and English stemmer:
376 CREATE TEXT SEARCH DICTIONARY thesaurus_astro (
377 TEMPLATE = thesaurus,
378 DictFile = thesaurus_astro,
379 Dictionary = english_stem
382 ALTER TEXT SEARCH CONFIGURATION russian
383 ALTER MAPPING FOR asciiword, asciihword, hword_asciipart
384 WITH thesaurus_astro, english_stem;
386 Now we can see how it works. ts_lexize is not very useful for testing a
387 thesaurus, because it treats its input as a single token. Instead we
388 can use plainto_tsquery and to_tsvector which will break their input
389 strings into multiple tokens:
390 SELECT plainto_tsquery('supernova star');
395 SELECT to_tsvector('supernova star');
400 In principle, one can use to_tsquery if you quote the argument:
401 SELECT to_tsquery('''supernova star''');
406 Notice that supernova star matches supernovae stars in thesaurus_astro
407 because we specified the english_stem stemmer in the thesaurus
408 definition. The stemmer removed the e and s.
410 To index the original phrase as well as the substitute, just include it
411 in the right-hand part of the definition:
412 supernovae stars : sn supernovae stars
414 SELECT plainto_tsquery('supernova star');
416 -----------------------------
417 'sn' & 'supernova' & 'star'
419 12.6.5. Ispell Dictionary #
421 The Ispell dictionary template supports morphological dictionaries,
422 which can normalize many different linguistic forms of a word into the
423 same lexeme. For example, an English Ispell dictionary can match all
424 declensions and conjugations of the search term bank, e.g., banking,
425 banked, banks, banks', and bank's.
427 The standard PostgreSQL distribution does not include any Ispell
428 configuration files. Dictionaries for a large number of languages are
429 available from Ispell. Also, some more modern dictionary file formats
430 are supported — MySpell (OO < 2.0.1) and Hunspell (OO >= 2.0.2). A
431 large list of dictionaries is available on the OpenOffice Wiki.
433 To create an Ispell dictionary perform these steps:
434 * download dictionary configuration files. OpenOffice extension files
435 have the .oxt extension. It is necessary to extract .aff and .dic
436 files, change extensions to .affix and .dict. For some dictionary
437 files it is also needed to convert characters to the UTF-8 encoding
438 with commands (for example, for a Norwegian language dictionary):
439 iconv -f ISO_8859-1 -t UTF-8 -o nn_no.affix nn_NO.aff
440 iconv -f ISO_8859-1 -t UTF-8 -o nn_no.dict nn_NO.dic
442 * copy files to the $SHAREDIR/tsearch_data directory
443 * load files into PostgreSQL with the following command:
444 CREATE TEXT SEARCH DICTIONARY english_hunspell (
448 Stopwords = english);
450 Here, DictFile, AffFile, and StopWords specify the base names of the
451 dictionary, affixes, and stop-words files. The stop-words file has the
452 same format explained above for the simple dictionary type. The format
453 of the other files is not specified here but is available from the
454 above-mentioned web sites.
456 Ispell dictionaries usually recognize a limited set of words, so they
457 should be followed by another broader dictionary; for example, a
458 Snowball dictionary, which recognizes everything.
460 The .affix file of Ispell has the following structure:
463 . > RE # As in enter > reenter
466 E > ST # As in late > latest
467 [^AEIOU]Y > -Y,IEST # As in dirty > dirtiest
468 [AEIOU]Y > EST # As in gray > grayest
469 [^EY] > EST # As in small > smallest
471 And the .dict file has the following structure:
477 Format of the .dict file is:
478 basic_form/affix_class_name
480 In the .affix file every affix flag is described in the following
482 condition > [-stripping_letters,] adding_affix
484 Here, condition has a format similar to the format of regular
485 expressions. It can use groupings [...] and [^...]. For example,
486 [AEIOU]Y means that the last letter of the word is "y" and the
487 penultimate letter is "a", "e", "i", "o" or "u". [^EY] means that the
488 last letter is neither "e" nor "y".
490 Ispell dictionaries support splitting compound words; a useful feature.
491 Notice that the affix file should specify a special flag using the
492 compoundwords controlled statement that marks dictionary words that can
493 participate in compound formation:
494 compoundwords controlled z
496 Here are some examples for the Norwegian language:
497 SELECT ts_lexize('norwegian_ispell', 'overbuljongterningpakkmesterassistent');
498 {over,buljong,terning,pakk,mester,assistent}
499 SELECT ts_lexize('norwegian_ispell', 'sjokoladefabrikk');
500 {sjokoladefabrikk,sjokolade,fabrikk}
502 MySpell format is a subset of Hunspell. The .affix file of Hunspell has
503 the following structure:
508 SFX T y iest [^aeiou]y
512 The first line of an affix class is the header. Fields of an affix
513 rules are listed after the header:
514 * parameter name (PFX or SFX)
515 * flag (name of the affix class)
516 * stripping characters from beginning (at prefix) or end (at suffix)
519 * condition that has a format similar to the format of regular
522 The .dict file looks like the .dict file of Ispell:
530 MySpell does not support compound words. Hunspell has sophisticated
531 support for compound words. At present, PostgreSQL implements only the
532 basic compound word operations of Hunspell.
534 12.6.6. Snowball Dictionary #
536 The Snowball dictionary template is based on a project by Martin
537 Porter, inventor of the popular Porter's stemming algorithm for the
538 English language. Snowball now provides stemming algorithms for many
539 languages (see the Snowball site for more information). Each algorithm
540 understands how to reduce common variant forms of words to a base, or
541 stem, spelling within its language. A Snowball dictionary requires a
542 language parameter to identify which stemmer to use, and optionally can
543 specify a stopword file name that gives a list of words to eliminate.
544 (PostgreSQL's standard stopword lists are also provided by the Snowball
545 project.) For example, there is a built-in definition equivalent to
546 CREATE TEXT SEARCH DICTIONARY english_stem (
552 The stopword file format is the same as already explained.
554 A Snowball dictionary recognizes everything, whether or not it is able
555 to simplify the word, so it should be placed at the end of the
556 dictionary list. It is useless to have it before any other dictionary
557 because a token will never pass through it to the next dictionary.