1 <?xml version="1.0" encoding="UTF-8" standalone="no"?>
2 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>12.6. Dictionaries</title><link rel="stylesheet" type="text/css" href="stylesheet.css" /><link rev="made" href="pgsql-docs@lists.postgresql.org" /><meta name="generator" content="DocBook XSL Stylesheets Vsnapshot" /><link rel="prev" href="textsearch-parsers.html" title="12.5. Parsers" /><link rel="next" href="textsearch-configuration.html" title="12.7. Configuration Example" /></head><body id="docContent" class="container-fluid col-10"><div class="navheader"><table width="100%" summary="Navigation header"><tr><th colspan="5" align="center">12.6. Dictionaries</th></tr><tr><td width="10%" align="left"><a accesskey="p" href="textsearch-parsers.html" title="12.5. Parsers">Prev</a> </td><td width="10%" align="left"><a accesskey="u" href="textsearch.html" title="Chapter 12. Full Text Search">Up</a></td><th width="60%" align="center">Chapter 12. Full Text Search</th><td width="10%" align="right"><a accesskey="h" href="index.html" title="PostgreSQL 18.0 Documentation">Home</a></td><td width="10%" align="right"> <a accesskey="n" href="textsearch-configuration.html" title="12.7. Configuration Example">Next</a></td></tr></table><hr /></div><div class="sect1" id="TEXTSEARCH-DICTIONARIES"><div class="titlepage"><div><div><h2 class="title" style="clear: both">12.6. Dictionaries <a href="#TEXTSEARCH-DICTIONARIES" class="id_link">#</a></h2></div></div></div><div class="toc"><dl class="toc"><dt><span class="sect2"><a href="textsearch-dictionaries.html#TEXTSEARCH-STOPWORDS">12.6.1. Stop Words</a></span></dt><dt><span class="sect2"><a href="textsearch-dictionaries.html#TEXTSEARCH-SIMPLE-DICTIONARY">12.6.2. Simple Dictionary</a></span></dt><dt><span class="sect2"><a href="textsearch-dictionaries.html#TEXTSEARCH-SYNONYM-DICTIONARY">12.6.3. Synonym Dictionary</a></span></dt><dt><span class="sect2"><a href="textsearch-dictionaries.html#TEXTSEARCH-THESAURUS">12.6.4. Thesaurus Dictionary</a></span></dt><dt><span class="sect2"><a href="textsearch-dictionaries.html#TEXTSEARCH-ISPELL-DICTIONARY">12.6.5. <span class="application">Ispell</span> Dictionary</a></span></dt><dt><span class="sect2"><a href="textsearch-dictionaries.html#TEXTSEARCH-SNOWBALL-DICTIONARY">12.6.6. <span class="application">Snowball</span> Dictionary</a></span></dt></dl></div><p>
3 Dictionaries are used to eliminate words that should not be considered in a
4 search (<em class="firstterm">stop words</em>), and to <em class="firstterm">normalize</em> words so
5 that different derived forms of the same word will match. A successfully
6 normalized word is called a <em class="firstterm">lexeme</em>. Aside from
7 improving search quality, normalization and removal of stop words reduce the
8 size of the <code class="type">tsvector</code> representation of a document, thereby
9 improving performance. Normalization does not always have linguistic meaning
10 and usually depends on application semantics.
12 Some examples of normalization:
14 </p><div class="itemizedlist"><ul class="itemizedlist compact" style="list-style-type: bullet; "><li class="listitem" style="list-style-type: disc"><p>
15 Linguistic — Ispell dictionaries try to reduce input words to a
16 normalized form; stemmer dictionaries remove word endings
17 </p></li><li class="listitem" style="list-style-type: disc"><p>
18 <acronym class="acronym">URL</acronym> locations can be canonicalized to make
19 equivalent URLs match:
21 </p><div class="itemizedlist"><ul class="itemizedlist compact" style="list-style-type: bullet; "><li class="listitem" style="list-style-type: disc"><p>
22 http://www.pgsql.ru/db/mw/index.html
23 </p></li><li class="listitem" style="list-style-type: disc"><p>
24 http://www.pgsql.ru/db/mw/
25 </p></li><li class="listitem" style="list-style-type: disc"><p>
26 http://www.pgsql.ru/db/../db/mw/index.html
27 </p></li></ul></div><p>
28 </p></li><li class="listitem" style="list-style-type: disc"><p>
29 Color names can be replaced by their hexadecimal values, e.g.,
30 <code class="literal">red, green, blue, magenta -> FF0000, 00FF00, 0000FF, FF00FF</code>
31 </p></li><li class="listitem" style="list-style-type: disc"><p>
32 If indexing numbers, we can
33 remove some fractional digits to reduce the range of possible
34 numbers, so for example <span class="emphasis"><em>3.14</em></span>159265359,
35 <span class="emphasis"><em>3.14</em></span>15926, <span class="emphasis"><em>3.14</em></span> will be the same
36 after normalization if only two digits are kept after the decimal point.
37 </p></li></ul></div><p>
40 A dictionary is a program that accepts a token as
42 </p><div class="itemizedlist"><ul class="itemizedlist compact" style="list-style-type: bullet; "><li class="listitem" style="list-style-type: disc"><p>
43 an array of lexemes if the input token is known to the dictionary
44 (notice that one token can produce more than one lexeme)
45 </p></li><li class="listitem" style="list-style-type: disc"><p>
46 a single lexeme with the <code class="literal">TSL_FILTER</code> flag set, to replace
47 the original token with a new token to be passed to subsequent
48 dictionaries (a dictionary that does this is called a
49 <em class="firstterm">filtering dictionary</em>)
50 </p></li><li class="listitem" style="list-style-type: disc"><p>
51 an empty array if the dictionary knows the token, but it is a stop word
52 </p></li><li class="listitem" style="list-style-type: disc"><p>
53 <code class="literal">NULL</code> if the dictionary does not recognize the input token
54 </p></li></ul></div><p>
56 <span class="productname">PostgreSQL</span> provides predefined dictionaries for
57 many languages. There are also several predefined templates that can be
58 used to create new dictionaries with custom parameters. Each predefined
59 dictionary template is described below. If no existing
60 template is suitable, it is possible to create new ones; see the
61 <code class="filename">contrib/</code> area of the <span class="productname">PostgreSQL</span> distribution
64 A text search configuration binds a parser together with a set of
65 dictionaries to process the parser's output tokens. For each token
66 type that the parser can return, a separate list of dictionaries is
67 specified by the configuration. When a token of that type is found
68 by the parser, each dictionary in the list is consulted in turn,
69 until some dictionary recognizes it as a known word. If it is identified
70 as a stop word, or if no dictionary recognizes the token, it will be
71 discarded and not indexed or searched for.
72 Normally, the first dictionary that returns a non-<code class="literal">NULL</code>
73 output determines the result, and any remaining dictionaries are not
74 consulted; but a filtering dictionary can replace the given word
75 with a modified word, which is then passed to subsequent dictionaries.
77 The general rule for configuring a list of dictionaries
78 is to place first the most narrow, most specific dictionary, then the more
79 general dictionaries, finishing with a very general dictionary, like
80 a <span class="application">Snowball</span> stemmer or <code class="literal">simple</code>, which
81 recognizes everything. For example, for an astronomy-specific search
82 (<code class="literal">astro_en</code> configuration) one could bind token type
83 <code class="type">asciiword</code> (ASCII word) to a synonym dictionary of astronomical
84 terms, a general English dictionary and a <span class="application">Snowball</span> English
87 </p><pre class="programlisting">
88 ALTER TEXT SEARCH CONFIGURATION astro_en
89 ADD MAPPING FOR asciiword WITH astrosyn, english_ispell, english_stem;
92 A filtering dictionary can be placed anywhere in the list, except at the
93 end where it'd be useless. Filtering dictionaries are useful to partially
94 normalize words to simplify the task of later dictionaries. For example,
95 a filtering dictionary could be used to remove accents from accented
96 letters, as is done by the <a class="xref" href="unaccent.html" title="F.48. unaccent — a text search dictionary which removes diacritics">unaccent</a> module.
97 </p><div class="sect2" id="TEXTSEARCH-STOPWORDS"><div class="titlepage"><div><div><h3 class="title">12.6.1. Stop Words <a href="#TEXTSEARCH-STOPWORDS" class="id_link">#</a></h3></div></div></div><p>
98 Stop words are words that are very common, appear in almost every
99 document, and have no discrimination value. Therefore, they can be ignored
100 in the context of full text searching. For example, every English text
101 contains words like <code class="literal">a</code> and <code class="literal">the</code>, so it is
102 useless to store them in an index. However, stop words do affect the
103 positions in <code class="type">tsvector</code>, which in turn affect ranking:
105 </p><pre class="screen">
106 SELECT to_tsvector('english', 'in the list of stop words');
108 ----------------------------
109 'list':3 'stop':5 'word':6
112 The missing positions 1,2,4 are because of stop words. Ranks
113 calculated for documents with and without stop words are quite different:
115 </p><pre class="screen">
116 SELECT ts_rank_cd (to_tsvector('english', 'in the list of stop words'), to_tsquery('list & stop'));
121 SELECT ts_rank_cd (to_tsvector('english', 'list stop words'), to_tsquery('list & stop'));
128 It is up to the specific dictionary how it treats stop words. For example,
129 <code class="literal">ispell</code> dictionaries first normalize words and then
130 look at the list of stop words, while <code class="literal">Snowball</code> stemmers
131 first check the list of stop words. The reason for the different
132 behavior is an attempt to decrease noise.
133 </p></div><div class="sect2" id="TEXTSEARCH-SIMPLE-DICTIONARY"><div class="titlepage"><div><div><h3 class="title">12.6.2. Simple Dictionary <a href="#TEXTSEARCH-SIMPLE-DICTIONARY" class="id_link">#</a></h3></div></div></div><p>
134 The <code class="literal">simple</code> dictionary template operates by converting the
135 input token to lower case and checking it against a file of stop words.
136 If it is found in the file then an empty array is returned, causing
137 the token to be discarded. If not, the lower-cased form of the word
138 is returned as the normalized lexeme. Alternatively, the dictionary
139 can be configured to report non-stop-words as unrecognized, allowing
140 them to be passed on to the next dictionary in the list.
142 Here is an example of a dictionary definition using the <code class="literal">simple</code>
145 </p><pre class="programlisting">
146 CREATE TEXT SEARCH DICTIONARY public.simple_dict (
147 TEMPLATE = pg_catalog.simple,
152 Here, <code class="literal">english</code> is the base name of a file of stop words.
153 The file's full name will be
154 <code class="filename">$SHAREDIR/tsearch_data/english.stop</code>,
155 where <code class="literal">$SHAREDIR</code> means the
156 <span class="productname">PostgreSQL</span> installation's shared-data directory,
157 often <code class="filename">/usr/local/share/postgresql</code> (use <code class="command">pg_config
158 --sharedir</code> to determine it if you're not sure).
159 The file format is simply a list
160 of words, one per line. Blank lines and trailing spaces are ignored,
161 and upper case is folded to lower case, but no other processing is done
162 on the file contents.
164 Now we can test our dictionary:
166 </p><pre class="screen">
167 SELECT ts_lexize('public.simple_dict', 'YeS');
172 SELECT ts_lexize('public.simple_dict', 'The');
178 We can also choose to return <code class="literal">NULL</code>, instead of the lower-cased
179 word, if it is not found in the stop words file. This behavior is
180 selected by setting the dictionary's <code class="literal">Accept</code> parameter to
181 <code class="literal">false</code>. Continuing the example:
183 </p><pre class="screen">
184 ALTER TEXT SEARCH DICTIONARY public.simple_dict ( Accept = false );
186 SELECT ts_lexize('public.simple_dict', 'YeS');
191 SELECT ts_lexize('public.simple_dict', 'The');
197 With the default setting of <code class="literal">Accept</code> = <code class="literal">true</code>,
198 it is only useful to place a <code class="literal">simple</code> dictionary at the end
199 of a list of dictionaries, since it will never pass on any token to
200 a following dictionary. Conversely, <code class="literal">Accept</code> = <code class="literal">false</code>
201 is only useful when there is at least one following dictionary.
202 </p><div class="caution"><h3 class="title">Caution</h3><p>
203 Most types of dictionaries rely on configuration files, such as files of
204 stop words. These files <span class="emphasis"><em>must</em></span> be stored in UTF-8 encoding.
205 They will be translated to the actual database encoding, if that is
206 different, when they are read into the server.
207 </p></div><div class="caution"><h3 class="title">Caution</h3><p>
208 Normally, a database session will read a dictionary configuration file
209 only once, when it is first used within the session. If you modify a
210 configuration file and want to force existing sessions to pick up the
211 new contents, issue an <code class="command">ALTER TEXT SEARCH DICTIONARY</code> command
212 on the dictionary. This can be a <span class="quote">“<span class="quote">dummy</span>”</span> update that doesn't
213 actually change any parameter values.
214 </p></div></div><div class="sect2" id="TEXTSEARCH-SYNONYM-DICTIONARY"><div class="titlepage"><div><div><h3 class="title">12.6.3. Synonym Dictionary <a href="#TEXTSEARCH-SYNONYM-DICTIONARY" class="id_link">#</a></h3></div></div></div><p>
215 This dictionary template is used to create dictionaries that replace a
216 word with a synonym. Phrases are not supported (use the thesaurus
217 template (<a class="xref" href="textsearch-dictionaries.html#TEXTSEARCH-THESAURUS" title="12.6.4. Thesaurus Dictionary">Section 12.6.4</a>) for that). A synonym
218 dictionary can be used to overcome linguistic problems, for example, to
219 prevent an English stemmer dictionary from reducing the word <span class="quote">“<span class="quote">Paris</span>”</span> to
220 <span class="quote">“<span class="quote">pari</span>”</span>. It is enough to have a <code class="literal">Paris paris</code> line in the
221 synonym dictionary and put it before the <code class="literal">english_stem</code>
222 dictionary. For example:
224 </p><pre class="screen">
225 SELECT * FROM ts_debug('english', 'Paris');
226 alias | description | token | dictionaries | dictionary | lexemes
227 -----------+-----------------+-------+----------------+--------------+---------
228 asciiword | Word, all ASCII | Paris | {english_stem} | english_stem | {pari}
230 CREATE TEXT SEARCH DICTIONARY my_synonym (
232 SYNONYMS = my_synonyms
235 ALTER TEXT SEARCH CONFIGURATION english
236 ALTER MAPPING FOR asciiword
237 WITH my_synonym, english_stem;
239 SELECT * FROM ts_debug('english', 'Paris');
240 alias | description | token | dictionaries | dictionary | lexemes
241 -----------+-----------------+-------+---------------------------+------------+---------
242 asciiword | Word, all ASCII | Paris | {my_synonym,english_stem} | my_synonym | {paris}
245 The only parameter required by the <code class="literal">synonym</code> template is
246 <code class="literal">SYNONYMS</code>, which is the base name of its configuration file
247 — <code class="literal">my_synonyms</code> in the above example.
248 The file's full name will be
249 <code class="filename">$SHAREDIR/tsearch_data/my_synonyms.syn</code>
250 (where <code class="literal">$SHAREDIR</code> means the
251 <span class="productname">PostgreSQL</span> installation's shared-data directory).
252 The file format is just one line
253 per word to be substituted, with the word followed by its synonym,
254 separated by white space. Blank lines and trailing spaces are ignored.
256 The <code class="literal">synonym</code> template also has an optional parameter
257 <code class="literal">CaseSensitive</code>, which defaults to <code class="literal">false</code>. When
258 <code class="literal">CaseSensitive</code> is <code class="literal">false</code>, words in the synonym file
259 are folded to lower case, as are input tokens. When it is
260 <code class="literal">true</code>, words and tokens are not folded to lower case,
261 but are compared as-is.
263 An asterisk (<code class="literal">*</code>) can be placed at the end of a synonym
264 in the configuration file. This indicates that the synonym is a prefix.
265 The asterisk is ignored when the entry is used in
266 <code class="function">to_tsvector()</code>, but when it is used in
267 <code class="function">to_tsquery()</code>, the result will be a query item with
268 the prefix match marker (see
269 <a class="xref" href="textsearch-controls.html#TEXTSEARCH-PARSING-QUERIES" title="12.3.2. Parsing Queries">Section 12.3.2</a>).
270 For example, suppose we have these entries in
271 <code class="filename">$SHAREDIR/tsearch_data/synonym_sample.syn</code>:
272 </p><pre class="programlisting">
279 Then we will get these results:
280 </p><pre class="screen">
281 mydb=# CREATE TEXT SEARCH DICTIONARY syn (template=synonym, synonyms='synonym_sample');
282 mydb=# SELECT ts_lexize('syn', 'indices');
288 mydb=# CREATE TEXT SEARCH CONFIGURATION tst (copy=simple);
289 mydb=# ALTER TEXT SEARCH CONFIGURATION tst ALTER MAPPING FOR asciiword WITH syn;
290 mydb=# SELECT to_tsvector('tst', 'indices');
296 mydb=# SELECT to_tsquery('tst', 'indices');
302 mydb=# SELECT 'indexes are very useful'::tsvector;
304 ---------------------------------
305 'are' 'indexes' 'useful' 'very'
308 mydb=# SELECT 'indexes are very useful'::tsvector @@ to_tsquery('tst', 'indices');
314 </p></div><div class="sect2" id="TEXTSEARCH-THESAURUS"><div class="titlepage"><div><div><h3 class="title">12.6.4. Thesaurus Dictionary <a href="#TEXTSEARCH-THESAURUS" class="id_link">#</a></h3></div></div></div><p>
315 A thesaurus dictionary (sometimes abbreviated as <acronym class="acronym">TZ</acronym>) is
316 a collection of words that includes information about the relationships
317 of words and phrases, i.e., broader terms (<acronym class="acronym">BT</acronym>), narrower
318 terms (<acronym class="acronym">NT</acronym>), preferred terms, non-preferred terms, related
321 Basically a thesaurus dictionary replaces all non-preferred terms by one
322 preferred term and, optionally, preserves the original terms for indexing
323 as well. <span class="productname">PostgreSQL</span>'s current implementation of the
324 thesaurus dictionary is an extension of the synonym dictionary with added
325 <em class="firstterm">phrase</em> support. A thesaurus dictionary requires
326 a configuration file of the following format:
328 </p><pre class="programlisting">
330 sample word(s) : indexed word(s)
331 more sample word(s) : more indexed word(s)
335 where the colon (<code class="symbol">:</code>) symbol acts as a delimiter between a
336 phrase and its replacement.
338 A thesaurus dictionary uses a <em class="firstterm">subdictionary</em> (which
339 is specified in the dictionary's configuration) to normalize the input
340 text before checking for phrase matches. It is only possible to select one
341 subdictionary. An error is reported if the subdictionary fails to
342 recognize a word. In that case, you should remove the use of the word or
343 teach the subdictionary about it. You can place an asterisk
344 (<code class="symbol">*</code>) at the beginning of an indexed word to skip applying
345 the subdictionary to it, but all sample words <span class="emphasis"><em>must</em></span> be known
346 to the subdictionary.
348 The thesaurus dictionary chooses the longest match if there are multiple
349 phrases matching the input, and ties are broken by using the last
352 Specific stop words recognized by the subdictionary cannot be
353 specified; instead use <code class="literal">?</code> to mark the location where any
354 stop word can appear. For example, assuming that <code class="literal">a</code> and
355 <code class="literal">the</code> are stop words according to the subdictionary:
357 </p><pre class="programlisting">
361 matches <code class="literal">a one the two</code> and <code class="literal">the one a two</code>;
362 both would be replaced by <code class="literal">swsw</code>.
364 Since a thesaurus dictionary has the capability to recognize phrases it
365 must remember its state and interact with the parser. A thesaurus dictionary
366 uses these assignments to check if it should handle the next word or stop
367 accumulation. The thesaurus dictionary must be configured
368 carefully. For example, if the thesaurus dictionary is assigned to handle
369 only the <code class="literal">asciiword</code> token, then a thesaurus dictionary
370 definition like <code class="literal">one 7</code> will not work since token type
371 <code class="literal">uint</code> is not assigned to the thesaurus dictionary.
372 </p><div class="caution"><h3 class="title">Caution</h3><p>
373 Thesauruses are used during indexing so any change in the thesaurus
374 dictionary's parameters <span class="emphasis"><em>requires</em></span> reindexing.
375 For most other dictionary types, small changes such as adding or
376 removing stopwords does not force reindexing.
377 </p></div><div class="sect3" id="TEXTSEARCH-THESAURUS-CONFIG"><div class="titlepage"><div><div><h4 class="title">12.6.4.1. Thesaurus Configuration <a href="#TEXTSEARCH-THESAURUS-CONFIG" class="id_link">#</a></h4></div></div></div><p>
378 To define a new thesaurus dictionary, use the <code class="literal">thesaurus</code>
379 template. For example:
381 </p><pre class="programlisting">
382 CREATE TEXT SEARCH DICTIONARY thesaurus_simple (
383 TEMPLATE = thesaurus,
384 DictFile = mythesaurus,
385 Dictionary = pg_catalog.english_stem
390 </p><div class="itemizedlist"><ul class="itemizedlist compact" style="list-style-type: bullet; "><li class="listitem" style="list-style-type: disc"><p>
391 <code class="literal">thesaurus_simple</code> is the new dictionary's name
392 </p></li><li class="listitem" style="list-style-type: disc"><p>
393 <code class="literal">mythesaurus</code> is the base name of the thesaurus
395 (Its full name will be <code class="filename">$SHAREDIR/tsearch_data/mythesaurus.ths</code>,
396 where <code class="literal">$SHAREDIR</code> means the installation shared-data
398 </p></li><li class="listitem" style="list-style-type: disc"><p>
399 <code class="literal">pg_catalog.english_stem</code> is the subdictionary (here,
400 a Snowball English stemmer) to use for thesaurus normalization.
401 Notice that the subdictionary will have its own
402 configuration (for example, stop words), which is not shown here.
403 </p></li></ul></div><p>
405 Now it is possible to bind the thesaurus dictionary <code class="literal">thesaurus_simple</code>
406 to the desired token types in a configuration, for example:
408 </p><pre class="programlisting">
409 ALTER TEXT SEARCH CONFIGURATION russian
410 ALTER MAPPING FOR asciiword, asciihword, hword_asciipart
411 WITH thesaurus_simple;
413 </p></div><div class="sect3" id="TEXTSEARCH-THESAURUS-EXAMPLES"><div class="titlepage"><div><div><h4 class="title">12.6.4.2. Thesaurus Example <a href="#TEXTSEARCH-THESAURUS-EXAMPLES" class="id_link">#</a></h4></div></div></div><p>
414 Consider a simple astronomical thesaurus <code class="literal">thesaurus_astro</code>,
415 which contains some astronomical word combinations:
417 </p><pre class="programlisting">
418 supernovae stars : sn
422 Below we create a dictionary and bind some token types to
423 an astronomical thesaurus and English stemmer:
425 </p><pre class="programlisting">
426 CREATE TEXT SEARCH DICTIONARY thesaurus_astro (
427 TEMPLATE = thesaurus,
428 DictFile = thesaurus_astro,
429 Dictionary = english_stem
432 ALTER TEXT SEARCH CONFIGURATION russian
433 ALTER MAPPING FOR asciiword, asciihword, hword_asciipart
434 WITH thesaurus_astro, english_stem;
437 Now we can see how it works.
438 <code class="function">ts_lexize</code> is not very useful for testing a thesaurus,
439 because it treats its input as a single token. Instead we can use
440 <code class="function">plainto_tsquery</code> and <code class="function">to_tsvector</code>
441 which will break their input strings into multiple tokens:
443 </p><pre class="screen">
444 SELECT plainto_tsquery('supernova star');
449 SELECT to_tsvector('supernova star');
455 In principle, one can use <code class="function">to_tsquery</code> if you quote
458 </p><pre class="screen">
459 SELECT to_tsquery('''supernova star''');
465 Notice that <code class="literal">supernova star</code> matches <code class="literal">supernovae
466 stars</code> in <code class="literal">thesaurus_astro</code> because we specified
467 the <code class="literal">english_stem</code> stemmer in the thesaurus definition.
468 The stemmer removed the <code class="literal">e</code> and <code class="literal">s</code>.
470 To index the original phrase as well as the substitute, just include it
471 in the right-hand part of the definition:
473 </p><pre class="screen">
474 supernovae stars : sn supernovae stars
476 SELECT plainto_tsquery('supernova star');
478 -----------------------------
479 'sn' & 'supernova' & 'star'
481 </p></div></div><div class="sect2" id="TEXTSEARCH-ISPELL-DICTIONARY"><div class="titlepage"><div><div><h3 class="title">12.6.5. <span class="application">Ispell</span> Dictionary <a href="#TEXTSEARCH-ISPELL-DICTIONARY" class="id_link">#</a></h3></div></div></div><p>
482 The <span class="application">Ispell</span> dictionary template supports
483 <em class="firstterm">morphological dictionaries</em>, which can normalize many
484 different linguistic forms of a word into the same lexeme. For example,
485 an English <span class="application">Ispell</span> dictionary can match all declensions and
486 conjugations of the search term <code class="literal">bank</code>, e.g.,
487 <code class="literal">banking</code>, <code class="literal">banked</code>, <code class="literal">banks</code>,
488 <code class="literal">banks'</code>, and <code class="literal">bank's</code>.
490 The standard <span class="productname">PostgreSQL</span> distribution does
491 not include any <span class="application">Ispell</span> configuration files.
492 Dictionaries for a large number of languages are available from <a class="ulink" href="https://www.cs.hmc.edu/~geoff/ispell.html" target="_top">Ispell</a>.
493 Also, some more modern dictionary file formats are supported — <a class="ulink" href="https://en.wikipedia.org/wiki/MySpell" target="_top">MySpell</a> (OO < 2.0.1)
494 and <a class="ulink" href="https://hunspell.github.io/" target="_top">Hunspell</a>
495 (OO >= 2.0.2). A large list of dictionaries is available on the <a class="ulink" href="https://wiki.openoffice.org/wiki/Dictionaries" target="_top">OpenOffice
498 To create an <span class="application">Ispell</span> dictionary perform these steps:
499 </p><div class="itemizedlist"><ul class="itemizedlist compact" style="list-style-type: bullet; "><li class="listitem" style="list-style-type: disc"><p>
500 download dictionary configuration files. <span class="productname">OpenOffice</span>
501 extension files have the <code class="filename">.oxt</code> extension. It is necessary
502 to extract <code class="filename">.aff</code> and <code class="filename">.dic</code> files, change
503 extensions to <code class="filename">.affix</code> and <code class="filename">.dict</code>. For some
504 dictionary files it is also needed to convert characters to the UTF-8
505 encoding with commands (for example, for a Norwegian language dictionary):
506 </p><pre class="programlisting">
507 iconv -f ISO_8859-1 -t UTF-8 -o nn_no.affix nn_NO.aff
508 iconv -f ISO_8859-1 -t UTF-8 -o nn_no.dict nn_NO.dic
510 </p></li><li class="listitem" style="list-style-type: disc"><p>
511 copy files to the <code class="filename">$SHAREDIR/tsearch_data</code> directory
512 </p></li><li class="listitem" style="list-style-type: disc"><p>
513 load files into PostgreSQL with the following command:
514 </p><pre class="programlisting">
515 CREATE TEXT SEARCH DICTIONARY english_hunspell (
519 Stopwords = english);
521 </p></li></ul></div><p>
522 Here, <code class="literal">DictFile</code>, <code class="literal">AffFile</code>, and <code class="literal">StopWords</code>
523 specify the base names of the dictionary, affixes, and stop-words files.
524 The stop-words file has the same format explained above for the
525 <code class="literal">simple</code> dictionary type. The format of the other files is
526 not specified here but is available from the above-mentioned web sites.
528 Ispell dictionaries usually recognize a limited set of words, so they
529 should be followed by another broader dictionary; for
530 example, a Snowball dictionary, which recognizes everything.
532 The <code class="filename">.affix</code> file of <span class="application">Ispell</span> has the following
534 </p><pre class="programlisting">
537 . > RE # As in enter > reenter
540 E > ST # As in late > latest
541 [^AEIOU]Y > -Y,IEST # As in dirty > dirtiest
542 [AEIOU]Y > EST # As in gray > grayest
543 [^EY] > EST # As in small > smallest
546 And the <code class="filename">.dict</code> file has the following structure:
547 </p><pre class="programlisting">
554 Format of the <code class="filename">.dict</code> file is:
555 </p><pre class="programlisting">
556 basic_form/affix_class_name
559 In the <code class="filename">.affix</code> file every affix flag is described in the
561 </p><pre class="programlisting">
562 condition > [-stripping_letters,] adding_affix
565 Here, condition has a format similar to the format of regular expressions.
566 It can use groupings <code class="literal">[...]</code> and <code class="literal">[^...]</code>.
567 For example, <code class="literal">[AEIOU]Y</code> means that the last letter of the word
568 is <code class="literal">"y"</code> and the penultimate letter is <code class="literal">"a"</code>,
569 <code class="literal">"e"</code>, <code class="literal">"i"</code>, <code class="literal">"o"</code> or <code class="literal">"u"</code>.
570 <code class="literal">[^EY]</code> means that the last letter is neither <code class="literal">"e"</code>
571 nor <code class="literal">"y"</code>.
573 Ispell dictionaries support splitting compound words;
575 Notice that the affix file should specify a special flag using the
576 <code class="literal">compoundwords controlled</code> statement that marks dictionary
577 words that can participate in compound formation:
579 </p><pre class="programlisting">
580 compoundwords controlled z
583 Here are some examples for the Norwegian language:
585 </p><pre class="programlisting">
586 SELECT ts_lexize('norwegian_ispell', 'overbuljongterningpakkmesterassistent');
587 {over,buljong,terning,pakk,mester,assistent}
588 SELECT ts_lexize('norwegian_ispell', 'sjokoladefabrikk');
589 {sjokoladefabrikk,sjokolade,fabrikk}
592 <span class="application">MySpell</span> format is a subset of <span class="application">Hunspell</span>.
593 The <code class="filename">.affix</code> file of <span class="application">Hunspell</span> has the following
595 </p><pre class="programlisting">
600 SFX T y iest [^aeiou]y
605 The first line of an affix class is the header. Fields of an affix rules are
606 listed after the header:
607 </p><div class="itemizedlist"><ul class="itemizedlist compact" style="list-style-type: bullet; "><li class="listitem" style="list-style-type: disc"><p>
608 parameter name (PFX or SFX)
609 </p></li><li class="listitem" style="list-style-type: disc"><p>
610 flag (name of the affix class)
611 </p></li><li class="listitem" style="list-style-type: disc"><p>
612 stripping characters from beginning (at prefix) or end (at suffix) of the
614 </p></li><li class="listitem" style="list-style-type: disc"><p>
616 </p></li><li class="listitem" style="list-style-type: disc"><p>
617 condition that has a format similar to the format of regular expressions.
618 </p></li></ul></div><p>
619 The <code class="filename">.dict</code> file looks like the <code class="filename">.dict</code> file of
620 <span class="application">Ispell</span>:
621 </p><pre class="programlisting">
627 </p><div class="note"><h3 class="title">Note</h3><p>
628 <span class="application">MySpell</span> does not support compound words.
629 <span class="application">Hunspell</span> has sophisticated support for compound words. At
630 present, <span class="productname">PostgreSQL</span> implements only the basic
631 compound word operations of Hunspell.
632 </p></div></div><div class="sect2" id="TEXTSEARCH-SNOWBALL-DICTIONARY"><div class="titlepage"><div><div><h3 class="title">12.6.6. <span class="application">Snowball</span> Dictionary <a href="#TEXTSEARCH-SNOWBALL-DICTIONARY" class="id_link">#</a></h3></div></div></div><p>
633 The <span class="application">Snowball</span> dictionary template is based on a project
634 by Martin Porter, inventor of the popular Porter's stemming algorithm
635 for the English language. Snowball now provides stemming algorithms for
636 many languages (see the <a class="ulink" href="https://snowballstem.org/" target="_top">Snowball
637 site</a> for more information). Each algorithm understands how to
638 reduce common variant forms of words to a base, or stem, spelling within
639 its language. A Snowball dictionary requires a <code class="literal">language</code>
640 parameter to identify which stemmer to use, and optionally can specify a
641 <code class="literal">stopword</code> file name that gives a list of words to eliminate.
642 (<span class="productname">PostgreSQL</span>'s standard stopword lists are also
643 provided by the Snowball project.)
644 For example, there is a built-in definition equivalent to
646 </p><pre class="programlisting">
647 CREATE TEXT SEARCH DICTIONARY english_stem (
654 The stopword file format is the same as already explained.
656 A <span class="application">Snowball</span> dictionary recognizes everything, whether
657 or not it is able to simplify the word, so it should be placed
658 at the end of the dictionary list. It is useless to have it
659 before any other dictionary because a token will never pass through it to
661 </p></div></div><div class="navfooter"><hr /><table width="100%" summary="Navigation footer"><tr><td width="40%" align="left"><a accesskey="p" href="textsearch-parsers.html" title="12.5. Parsers">Prev</a> </td><td width="20%" align="center"><a accesskey="u" href="textsearch.html" title="Chapter 12. Full Text Search">Up</a></td><td width="40%" align="right"> <a accesskey="n" href="textsearch-configuration.html" title="12.7. Configuration Example">Next</a></td></tr><tr><td width="40%" align="left" valign="top">12.5. Parsers </td><td width="20%" align="center"><a accesskey="h" href="index.html" title="PostgreSQL 18.0 Documentation">Home</a></td><td width="40%" align="right" valign="top"> 12.7. Configuration Example</td></tr></table></div></body></html>