begriffs open source - ai-pg/blob - full-docs/html/textsearch-dictionaries.html

   1 <?xml version="1.0" encoding="UTF-8" standalone="no"?>
   2 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>12.6. Dictionaries</title><link rel="stylesheet" type="text/css" href="stylesheet.css" /><link rev="made" href="pgsql-docs@lists.postgresql.org" /><meta name="generator" content="DocBook XSL Stylesheets Vsnapshot" /><link rel="prev" href="textsearch-parsers.html" title="12.5. Parsers" /><link rel="next" href="textsearch-configuration.html" title="12.7. Configuration Example" /></head><body id="docContent" class="container-fluid col-10"><div class="navheader"><table width="100%" summary="Navigation header"><tr><th colspan="5" align="center">12.6. Dictionaries</th></tr><tr><td width="10%" align="left"><a accesskey="p" href="textsearch-parsers.html" title="12.5. Parsers">Prev</a> </td><td width="10%" align="left"><a accesskey="u" href="textsearch.html" title="Chapter 12. Full Text Search">Up</a></td><th width="60%" align="center">Chapter 12. Full Text Search</th><td width="10%" align="right"><a accesskey="h" href="index.html" title="PostgreSQL 18.0 Documentation">Home</a></td><td width="10%" align="right"> <a accesskey="n" href="textsearch-configuration.html" title="12.7. Configuration Example">Next</a></td></tr></table><hr /></div><div class="sect1" id="TEXTSEARCH-DICTIONARIES"><div class="titlepage"><div><div><h2 class="title" style="clear: both">12.6. Dictionaries <a href="#TEXTSEARCH-DICTIONARIES" class="id_link">#</a></h2></div></div></div><div class="toc"><dl class="toc"><dt><span class="sect2"><a href="textsearch-dictionaries.html#TEXTSEARCH-STOPWORDS">12.6.1. Stop Words</a></span></dt><dt><span class="sect2"><a href="textsearch-dictionaries.html#TEXTSEARCH-SIMPLE-DICTIONARY">12.6.2. Simple Dictionary</a></span></dt><dt><span class="sect2"><a href="textsearch-dictionaries.html#TEXTSEARCH-SYNONYM-DICTIONARY">12.6.3. Synonym Dictionary</a></span></dt><dt><span class="sect2"><a href="textsearch-dictionaries.html#TEXTSEARCH-THESAURUS">12.6.4. Thesaurus Dictionary</a></span></dt><dt><span class="sect2"><a href="textsearch-dictionaries.html#TEXTSEARCH-ISPELL-DICTIONARY">12.6.5. <span class="application">Ispell</span> Dictionary</a></span></dt><dt><span class="sect2"><a href="textsearch-dictionaries.html#TEXTSEARCH-SNOWBALL-DICTIONARY">12.6.6. <span class="application">Snowball</span> Dictionary</a></span></dt></dl></div><p>
   3    Dictionaries are used to eliminate words that should not be considered in a
   4    search (<em class="firstterm">stop words</em>), and to <em class="firstterm">normalize</em> words so
   5    that different derived forms of the same word will match.  A successfully
   6    normalized word is called a <em class="firstterm">lexeme</em>.  Aside from
   7    improving search quality, normalization and removal of stop words reduce the
   8    size of the <code class="type">tsvector</code> representation of a document, thereby
   9    improving performance.  Normalization does not always have linguistic meaning
  10    and usually depends on application semantics.
  11   </p><p>
  12    Some examples of normalization:
  13
  14    </p><div class="itemizedlist"><ul class="itemizedlist compact" style="list-style-type: bullet; "><li class="listitem" style="list-style-type: disc"><p>
  15       Linguistic — Ispell dictionaries try to reduce input words to a
  16       normalized form; stemmer dictionaries remove word endings
  17      </p></li><li class="listitem" style="list-style-type: disc"><p>
  18       <acronym class="acronym">URL</acronym> locations can be canonicalized to make
  19       equivalent URLs match:
  20
  21       </p><div class="itemizedlist"><ul class="itemizedlist compact" style="list-style-type: bullet; "><li class="listitem" style="list-style-type: disc"><p>
  22          http://www.pgsql.ru/db/mw/index.html
  23         </p></li><li class="listitem" style="list-style-type: disc"><p>
  24          http://www.pgsql.ru/db/mw/
  25         </p></li><li class="listitem" style="list-style-type: disc"><p>
  26          http://www.pgsql.ru/db/../db/mw/index.html
  27         </p></li></ul></div><p>
  28      </p></li><li class="listitem" style="list-style-type: disc"><p>
  29       Color names can be replaced by their hexadecimal values, e.g.,
  30       <code class="literal">red, green, blue, magenta -&gt; FF0000, 00FF00, 0000FF, FF00FF</code>
  31      </p></li><li class="listitem" style="list-style-type: disc"><p>
  32       If indexing numbers, we can
  33       remove some fractional digits to reduce the range of possible
  34       numbers, so for example <span class="emphasis"><em>3.14</em></span>159265359,
  35       <span class="emphasis"><em>3.14</em></span>15926, <span class="emphasis"><em>3.14</em></span> will be the same
  36       after normalization if only two digits are kept after the decimal point.
  37      </p></li></ul></div><p>
  38
  39   </p><p>
  40    A dictionary is a program that accepts a token as
  41    input and returns:
  42    </p><div class="itemizedlist"><ul class="itemizedlist compact" style="list-style-type: bullet; "><li class="listitem" style="list-style-type: disc"><p>
  43       an array of lexemes if the input token is known to the dictionary
  44       (notice that one token can produce more than one lexeme)
  45      </p></li><li class="listitem" style="list-style-type: disc"><p>
  46       a single lexeme with the <code class="literal">TSL_FILTER</code> flag set, to replace
  47       the original token with a new token to be passed to subsequent
  48       dictionaries (a dictionary that does this is called a
  49       <em class="firstterm">filtering dictionary</em>)
  50      </p></li><li class="listitem" style="list-style-type: disc"><p>
  51       an empty array if the dictionary knows the token, but it is a stop word
  52      </p></li><li class="listitem" style="list-style-type: disc"><p>
  53       <code class="literal">NULL</code> if the dictionary does not recognize the input token
  54      </p></li></ul></div><p>
  55   </p><p>
  56    <span class="productname">PostgreSQL</span> provides predefined dictionaries for
  57    many languages.  There are also several predefined templates that can be
  58    used to create new dictionaries with custom parameters.  Each predefined
  59    dictionary template is described below.  If no existing
  60    template is suitable, it is possible to create new ones; see the
  61    <code class="filename">contrib/</code> area of the <span class="productname">PostgreSQL</span> distribution
  62    for examples.
  63   </p><p>
  64    A text search configuration binds a parser together with a set of
  65    dictionaries to process the parser's output tokens.  For each token
  66    type that the parser can return, a separate list of dictionaries is
  67    specified by the configuration.  When a token of that type is found
  68    by the parser, each dictionary in the list is consulted in turn,
  69    until some dictionary recognizes it as a known word.  If it is identified
  70    as a stop word, or if no dictionary recognizes the token, it will be
  71    discarded and not indexed or searched for.
  72    Normally, the first dictionary that returns a non-<code class="literal">NULL</code>
  73    output determines the result, and any remaining dictionaries are not
  74    consulted; but a filtering dictionary can replace the given word
  75    with a modified word, which is then passed to subsequent dictionaries.
  76   </p><p>
  77    The general rule for configuring a list of dictionaries
  78    is to place first the most narrow, most specific dictionary, then the more
  79    general dictionaries, finishing with a very general dictionary, like
  80    a <span class="application">Snowball</span> stemmer or <code class="literal">simple</code>, which
  81    recognizes everything.  For example, for an astronomy-specific search
  82    (<code class="literal">astro_en</code> configuration) one could bind token type
  83    <code class="type">asciiword</code> (ASCII word) to a synonym dictionary of astronomical
  84    terms, a general English dictionary and a <span class="application">Snowball</span> English
  85    stemmer:
  86
  87 </p><pre class="programlisting">
  88 ALTER TEXT SEARCH CONFIGURATION astro_en
  89     ADD MAPPING FOR asciiword WITH astrosyn, english_ispell, english_stem;
  90 </pre><p>
  91   </p><p>
  92    A filtering dictionary can be placed anywhere in the list, except at the
  93    end where it'd be useless.  Filtering dictionaries are useful to partially
  94    normalize words to simplify the task of later dictionaries.  For example,
  95    a filtering dictionary could be used to remove accents from accented
  96    letters, as is done by the <a class="xref" href="unaccent.html" title="F.48. unaccent — a text search dictionary which removes diacritics">unaccent</a> module.
  97   </p><div class="sect2" id="TEXTSEARCH-STOPWORDS"><div class="titlepage"><div><div><h3 class="title">12.6.1. Stop Words <a href="#TEXTSEARCH-STOPWORDS" class="id_link">#</a></h3></div></div></div><p>
  98     Stop words are words that are very common, appear in almost every
  99     document, and have no discrimination value. Therefore, they can be ignored
 100     in the context of full text searching. For example, every English text
 101     contains words like <code class="literal">a</code> and <code class="literal">the</code>, so it is
 102     useless to store them in an index.  However, stop words do affect the
 103     positions in <code class="type">tsvector</code>, which in turn affect ranking:
 104
 105 </p><pre class="screen">
 106 SELECT to_tsvector('english', 'in the list of stop words');
 107         to_tsvector
 108 ----------------------------
 109  'list':3 'stop':5 'word':6
 110 </pre><p>
 111
 112     The missing positions 1,2,4 are because of stop words.  Ranks
 113     calculated for documents with and without stop words are quite different:
 114
 115 </p><pre class="screen">
 116 SELECT ts_rank_cd (to_tsvector('english', 'in the list of stop words'), to_tsquery('list &amp; stop'));
 117  ts_rank_cd
 118 ------------
 119        0.05
 120
 121 SELECT ts_rank_cd (to_tsvector('english', 'list stop words'), to_tsquery('list &amp; stop'));
 122  ts_rank_cd
 123 ------------
 124         0.1
 125 </pre><p>
 126
 127    </p><p>
 128     It is up to the specific dictionary how it treats stop words. For example,
 129     <code class="literal">ispell</code> dictionaries first normalize words and then
 130     look at the list of stop words, while <code class="literal">Snowball</code> stemmers
 131     first check the list of stop words. The reason for the different
 132     behavior is an attempt to decrease noise.
 133    </p></div><div class="sect2" id="TEXTSEARCH-SIMPLE-DICTIONARY"><div class="titlepage"><div><div><h3 class="title">12.6.2. Simple Dictionary <a href="#TEXTSEARCH-SIMPLE-DICTIONARY" class="id_link">#</a></h3></div></div></div><p>
 134     The <code class="literal">simple</code> dictionary template operates by converting the
 135     input token to lower case and checking it against a file of stop words.
 136     If it is found in the file then an empty array is returned, causing
 137     the token to be discarded.  If not, the lower-cased form of the word
 138     is returned as the normalized lexeme.  Alternatively, the dictionary
 139     can be configured to report non-stop-words as unrecognized, allowing
 140     them to be passed on to the next dictionary in the list.
 141    </p><p>
 142     Here is an example of a dictionary definition using the <code class="literal">simple</code>
 143     template:
 144
 145 </p><pre class="programlisting">
 146 CREATE TEXT SEARCH DICTIONARY public.simple_dict (
 147     TEMPLATE = pg_catalog.simple,
 148     STOPWORDS = english
 149 );
 150 </pre><p>
 151
 152     Here, <code class="literal">english</code> is the base name of a file of stop words.
 153     The file's full name will be
 154     <code class="filename">$SHAREDIR/tsearch_data/english.stop</code>,
 155     where <code class="literal">$SHAREDIR</code> means the
 156     <span class="productname">PostgreSQL</span> installation's shared-data directory,
 157     often <code class="filename">/usr/local/share/postgresql</code> (use <code class="command">pg_config
 158     --sharedir</code> to determine it if you're not sure).
 159     The file format is simply a list
 160     of words, one per line.  Blank lines and trailing spaces are ignored,
 161     and upper case is folded to lower case, but no other processing is done
 162     on the file contents.
 163    </p><p>
 164     Now we can test our dictionary:
 165
 166 </p><pre class="screen">
 167 SELECT ts_lexize('public.simple_dict', 'YeS');
 168  ts_lexize
 169 -----------
 170  {yes}
 171
 172 SELECT ts_lexize('public.simple_dict', 'The');
 173  ts_lexize
 174 -----------
 175  {}
 176 </pre><p>
 177    </p><p>
 178     We can also choose to return <code class="literal">NULL</code>, instead of the lower-cased
 179     word, if it is not found in the stop words file.  This behavior is
 180     selected by setting the dictionary's <code class="literal">Accept</code> parameter to
 181     <code class="literal">false</code>.  Continuing the example:
 182
 183 </p><pre class="screen">
 184 ALTER TEXT SEARCH DICTIONARY public.simple_dict ( Accept = false );
 185
 186 SELECT ts_lexize('public.simple_dict', 'YeS');
 187  ts_lexize
 188 -----------
 189
 190
 191 SELECT ts_lexize('public.simple_dict', 'The');
 192  ts_lexize
 193 -----------
 194  {}
 195 </pre><p>
 196    </p><p>
 197     With the default setting of <code class="literal">Accept</code> = <code class="literal">true</code>,
 198     it is only useful to place a <code class="literal">simple</code> dictionary at the end
 199     of a list of dictionaries, since it will never pass on any token to
 200     a following dictionary.  Conversely, <code class="literal">Accept</code> = <code class="literal">false</code>
 201     is only useful when there is at least one following dictionary.
 202    </p><div class="caution"><h3 class="title">Caution</h3><p>
 203      Most types of dictionaries rely on configuration files, such as files of
 204      stop words.  These files <span class="emphasis"><em>must</em></span> be stored in UTF-8 encoding.
 205      They will be translated to the actual database encoding, if that is
 206      different, when they are read into the server.
 207     </p></div><div class="caution"><h3 class="title">Caution</h3><p>
 208      Normally, a database session will read a dictionary configuration file
 209      only once, when it is first used within the session.  If you modify a
 210      configuration file and want to force existing sessions to pick up the
 211      new contents, issue an <code class="command">ALTER TEXT SEARCH DICTIONARY</code> command
 212      on the dictionary.  This can be a <span class="quote">“<span class="quote">dummy</span>”</span> update that doesn't
 213      actually change any parameter values.
 214     </p></div></div><div class="sect2" id="TEXTSEARCH-SYNONYM-DICTIONARY"><div class="titlepage"><div><div><h3 class="title">12.6.3. Synonym Dictionary <a href="#TEXTSEARCH-SYNONYM-DICTIONARY" class="id_link">#</a></h3></div></div></div><p>
 215     This dictionary template is used to create dictionaries that replace a
 216     word with a synonym. Phrases are not supported (use the thesaurus
 217     template (<a class="xref" href="textsearch-dictionaries.html#TEXTSEARCH-THESAURUS" title="12.6.4. Thesaurus Dictionary">Section 12.6.4</a>) for that).  A synonym
 218     dictionary can be used to overcome linguistic problems, for example, to
 219     prevent an English stemmer dictionary from reducing the word <span class="quote">“<span class="quote">Paris</span>”</span> to
 220     <span class="quote">“<span class="quote">pari</span>”</span>.  It is enough to have a <code class="literal">Paris paris</code> line in the
 221     synonym dictionary and put it before the <code class="literal">english_stem</code>
 222     dictionary.  For example:
 223
 224 </p><pre class="screen">
 225 SELECT * FROM ts_debug('english', 'Paris');
 226    alias   |   description   | token |  dictionaries  |  dictionary  | lexemes
 227 -----------+-----------------+-------+----------------+--------------+---------
 228  asciiword | Word, all ASCII | Paris | {english_stem} | english_stem | {pari}
 229
 230 CREATE TEXT SEARCH DICTIONARY my_synonym (
 231     TEMPLATE = synonym,
 232     SYNONYMS = my_synonyms
 233 );
 234
 235 ALTER TEXT SEARCH CONFIGURATION english
 236     ALTER MAPPING FOR asciiword
 237     WITH my_synonym, english_stem;
 238
 239 SELECT * FROM ts_debug('english', 'Paris');
 240    alias   |   description   | token |       dictionaries        | dictionary | lexemes
 241 -----------+-----------------+-------+---------------------------+------------+---------
 242  asciiword | Word, all ASCII | Paris | {my_synonym,english_stem} | my_synonym | {paris}
 243 </pre><p>
 244    </p><p>
 245     The only parameter required by the <code class="literal">synonym</code> template is
 246     <code class="literal">SYNONYMS</code>, which is the base name of its configuration file
 247     — <code class="literal">my_synonyms</code> in the above example.
 248     The file's full name will be
 249     <code class="filename">$SHAREDIR/tsearch_data/my_synonyms.syn</code>
 250     (where <code class="literal">$SHAREDIR</code> means the
 251     <span class="productname">PostgreSQL</span> installation's shared-data directory).
 252     The file format is just one line
 253     per word to be substituted, with the word followed by its synonym,
 254     separated by white space.  Blank lines and trailing spaces are ignored.
 255    </p><p>
 256     The <code class="literal">synonym</code> template also has an optional parameter
 257     <code class="literal">CaseSensitive</code>, which defaults to <code class="literal">false</code>.  When
 258     <code class="literal">CaseSensitive</code> is <code class="literal">false</code>, words in the synonym file
 259     are folded to lower case, as are input tokens.  When it is
 260     <code class="literal">true</code>, words and tokens are not folded to lower case,
 261     but are compared as-is.
 262    </p><p>
 263     An asterisk (<code class="literal">*</code>) can be placed at the end of a synonym
 264     in the configuration file.  This indicates that the synonym is a prefix.
 265     The asterisk is ignored when the entry is used in
 266     <code class="function">to_tsvector()</code>, but when it is used in
 267     <code class="function">to_tsquery()</code>, the result will be a query item with
 268     the prefix match marker (see
 269     <a class="xref" href="textsearch-controls.html#TEXTSEARCH-PARSING-QUERIES" title="12.3.2. Parsing Queries">Section 12.3.2</a>).
 270     For example, suppose we have these entries in
 271     <code class="filename">$SHAREDIR/tsearch_data/synonym_sample.syn</code>:
 272 </p><pre class="programlisting">
 273 postgres        pgsql
 274 postgresql      pgsql
 275 postgre pgsql
 276 gogle   googl
 277 indices index*
 278 </pre><p>
 279     Then we will get these results:
 280 </p><pre class="screen">
 281 mydb=# CREATE TEXT SEARCH DICTIONARY syn (template=synonym, synonyms='synonym_sample');
 282 mydb=# SELECT ts_lexize('syn', 'indices');
 283  ts_lexize
 284 -----------
 285  {index}
 286 (1 row)
 287
 288 mydb=# CREATE TEXT SEARCH CONFIGURATION tst (copy=simple);
 289 mydb=# ALTER TEXT SEARCH CONFIGURATION tst ALTER MAPPING FOR asciiword WITH syn;
 290 mydb=# SELECT to_tsvector('tst', 'indices');
 291  to_tsvector
 292 -------------
 293  'index':1
 294 (1 row)
 295
 296 mydb=# SELECT to_tsquery('tst', 'indices');
 297  to_tsquery
 298 ------------
 299  'index':*
 300 (1 row)
 301
 302 mydb=# SELECT 'indexes are very useful'::tsvector;
 303             tsvector
 304 ---------------------------------
 305  'are' 'indexes' 'useful' 'very'
 306 (1 row)
 307
 308 mydb=# SELECT 'indexes are very useful'::tsvector @@ to_tsquery('tst', 'indices');
 309  ?column?
 310 ----------
 311  t
 312 (1 row)
 313 </pre><p>
 314    </p></div><div class="sect2" id="TEXTSEARCH-THESAURUS"><div class="titlepage"><div><div><h3 class="title">12.6.4. Thesaurus Dictionary <a href="#TEXTSEARCH-THESAURUS" class="id_link">#</a></h3></div></div></div><p>
 315     A thesaurus dictionary (sometimes abbreviated as <acronym class="acronym">TZ</acronym>) is
 316     a collection of words that includes information about the relationships
 317     of words and phrases, i.e., broader terms (<acronym class="acronym">BT</acronym>), narrower
 318     terms (<acronym class="acronym">NT</acronym>), preferred terms, non-preferred terms, related
 319     terms, etc.
 320    </p><p>
 321     Basically a thesaurus dictionary replaces all non-preferred terms by one
 322     preferred term and, optionally, preserves the original terms for indexing
 323     as well.  <span class="productname">PostgreSQL</span>'s current implementation of the
 324     thesaurus dictionary is an extension of the synonym dictionary with added
 325     <em class="firstterm">phrase</em> support.  A thesaurus dictionary requires
 326     a configuration file of the following format:
 327
 328 </p><pre class="programlisting">
 329 # this is a comment
 330 sample word(s) : indexed word(s)
 331 more sample word(s) : more indexed word(s)
 332 ...
 333 </pre><p>
 334
 335     where  the colon (<code class="symbol">:</code>) symbol acts as a delimiter between a
 336     phrase and its replacement.
 337    </p><p>
 338     A thesaurus dictionary uses a <em class="firstterm">subdictionary</em> (which
 339     is specified in the dictionary's configuration) to normalize the input
 340     text before checking for phrase matches. It is only possible to select one
 341     subdictionary.  An error is reported if the subdictionary fails to
 342     recognize a word. In that case, you should remove the use of the word or
 343     teach the subdictionary about it.  You can place an asterisk
 344     (<code class="symbol">*</code>) at the beginning of an indexed word to skip applying
 345     the subdictionary to it, but all sample words <span class="emphasis"><em>must</em></span> be known
 346     to the subdictionary.
 347    </p><p>
 348     The thesaurus dictionary chooses the longest match if there are multiple
 349     phrases matching the input, and ties are broken by using the last
 350     definition.
 351    </p><p>
 352     Specific stop words recognized by the subdictionary cannot be
 353     specified;  instead use <code class="literal">?</code> to mark the location where any
 354     stop word can appear.  For example, assuming that <code class="literal">a</code> and
 355     <code class="literal">the</code> are stop words according to the subdictionary:
 356
 357 </p><pre class="programlisting">
 358 ? one ? two : swsw
 359 </pre><p>
 360
 361     matches <code class="literal">a one the two</code> and <code class="literal">the one a two</code>;
 362     both would be replaced by <code class="literal">swsw</code>.
 363    </p><p>
 364     Since a thesaurus dictionary has the capability to recognize phrases it
 365     must remember its state and interact with the parser. A thesaurus dictionary
 366     uses these assignments to check if it should handle the next word or stop
 367     accumulation.  The thesaurus dictionary must be configured
 368     carefully. For example, if the thesaurus dictionary is assigned to handle
 369     only the <code class="literal">asciiword</code> token, then a thesaurus dictionary
 370     definition like <code class="literal">one 7</code> will not work since token type
 371     <code class="literal">uint</code> is not assigned to the thesaurus dictionary.
 372    </p><div class="caution"><h3 class="title">Caution</h3><p>
 373      Thesauruses are used during indexing so any change in the thesaurus
 374      dictionary's parameters <span class="emphasis"><em>requires</em></span> reindexing.
 375      For most other dictionary types, small changes such as adding or
 376      removing stopwords does not force reindexing.
 377     </p></div><div class="sect3" id="TEXTSEARCH-THESAURUS-CONFIG"><div class="titlepage"><div><div><h4 class="title">12.6.4.1. Thesaurus Configuration <a href="#TEXTSEARCH-THESAURUS-CONFIG" class="id_link">#</a></h4></div></div></div><p>
 378     To define a new thesaurus dictionary, use the <code class="literal">thesaurus</code>
 379     template.  For example:
 380
 381 </p><pre class="programlisting">
 382 CREATE TEXT SEARCH DICTIONARY thesaurus_simple (
 383     TEMPLATE = thesaurus,
 384     DictFile = mythesaurus,
 385     Dictionary = pg_catalog.english_stem
 386 );
 387 </pre><p>
 388
 389     Here:
 390     </p><div class="itemizedlist"><ul class="itemizedlist compact" style="list-style-type: bullet; "><li class="listitem" style="list-style-type: disc"><p>
 391        <code class="literal">thesaurus_simple</code> is the new dictionary's name
 392       </p></li><li class="listitem" style="list-style-type: disc"><p>
 393        <code class="literal">mythesaurus</code> is the base name of the thesaurus
 394        configuration file.
 395        (Its full name will be <code class="filename">$SHAREDIR/tsearch_data/mythesaurus.ths</code>,
 396        where <code class="literal">$SHAREDIR</code> means the installation shared-data
 397        directory.)
 398       </p></li><li class="listitem" style="list-style-type: disc"><p>
 399        <code class="literal">pg_catalog.english_stem</code> is the subdictionary (here,
 400        a Snowball English stemmer) to use for thesaurus normalization.
 401        Notice that the subdictionary will have its own
 402        configuration (for example, stop words), which is not shown here.
 403       </p></li></ul></div><p>
 404
 405     Now it is possible to bind the thesaurus dictionary <code class="literal">thesaurus_simple</code>
 406     to the desired token types in a configuration, for example:
 407
 408 </p><pre class="programlisting">
 409 ALTER TEXT SEARCH CONFIGURATION russian
 410     ALTER MAPPING FOR asciiword, asciihword, hword_asciipart
 411     WITH thesaurus_simple;
 412 </pre><p>
 413    </p></div><div class="sect3" id="TEXTSEARCH-THESAURUS-EXAMPLES"><div class="titlepage"><div><div><h4 class="title">12.6.4.2. Thesaurus Example <a href="#TEXTSEARCH-THESAURUS-EXAMPLES" class="id_link">#</a></h4></div></div></div><p>
 414     Consider a simple astronomical thesaurus <code class="literal">thesaurus_astro</code>,
 415     which contains some astronomical word combinations:
 416
 417 </p><pre class="programlisting">
 418 supernovae stars : sn
 419 crab nebulae : crab
 420 </pre><p>
 421
 422     Below we create a dictionary and bind some token types to
 423     an astronomical thesaurus and English stemmer:
 424
 425 </p><pre class="programlisting">
 426 CREATE TEXT SEARCH DICTIONARY thesaurus_astro (
 427     TEMPLATE = thesaurus,
 428     DictFile = thesaurus_astro,
 429     Dictionary = english_stem
 430 );
 431
 432 ALTER TEXT SEARCH CONFIGURATION russian
 433     ALTER MAPPING FOR asciiword, asciihword, hword_asciipart
 434     WITH thesaurus_astro, english_stem;
 435 </pre><p>
 436
 437     Now we can see how it works.
 438     <code class="function">ts_lexize</code> is not very useful for testing a thesaurus,
 439     because it treats its input as a single token.  Instead we can use
 440     <code class="function">plainto_tsquery</code> and <code class="function">to_tsvector</code>
 441     which will break their input strings into multiple tokens:
 442
 443 </p><pre class="screen">
 444 SELECT plainto_tsquery('supernova star');
 445  plainto_tsquery
 446 -----------------
 447  'sn'
 448
 449 SELECT to_tsvector('supernova star');
 450  to_tsvector
 451 -------------
 452  'sn':1
 453 </pre><p>
 454
 455     In principle, one can use <code class="function">to_tsquery</code> if you quote
 456     the argument:
 457
 458 </p><pre class="screen">
 459 SELECT to_tsquery('''supernova star''');
 460  to_tsquery
 461 ------------
 462  'sn'
 463 </pre><p>
 464
 465     Notice that <code class="literal">supernova star</code> matches <code class="literal">supernovae
 466     stars</code> in <code class="literal">thesaurus_astro</code> because we specified
 467     the <code class="literal">english_stem</code> stemmer in the thesaurus definition.
 468     The stemmer removed the <code class="literal">e</code> and <code class="literal">s</code>.
 469    </p><p>
 470     To index the original phrase as well as the substitute, just include it
 471     in the right-hand part of the definition:
 472
 473 </p><pre class="screen">
 474 supernovae stars : sn supernovae stars
 475
 476 SELECT plainto_tsquery('supernova star');
 477        plainto_tsquery
 478 -----------------------------
 479  'sn' &amp; 'supernova' &amp; 'star'
 480 </pre><p>
 481    </p></div></div><div class="sect2" id="TEXTSEARCH-ISPELL-DICTIONARY"><div class="titlepage"><div><div><h3 class="title">12.6.5. <span class="application">Ispell</span> Dictionary <a href="#TEXTSEARCH-ISPELL-DICTIONARY" class="id_link">#</a></h3></div></div></div><p>
 482     The <span class="application">Ispell</span> dictionary template supports
 483     <em class="firstterm">morphological dictionaries</em>, which can normalize many
 484     different linguistic forms of a word into the same lexeme.  For example,
 485     an English <span class="application">Ispell</span> dictionary can match all declensions and
 486     conjugations of the search term <code class="literal">bank</code>, e.g.,
 487     <code class="literal">banking</code>, <code class="literal">banked</code>, <code class="literal">banks</code>,
 488     <code class="literal">banks'</code>, and <code class="literal">bank's</code>.
 489    </p><p>
 490     The standard <span class="productname">PostgreSQL</span> distribution does
 491     not include any <span class="application">Ispell</span> configuration files.
 492     Dictionaries for a large number of languages are available from <a class="ulink" href="https://www.cs.hmc.edu/~geoff/ispell.html" target="_top">Ispell</a>.
 493     Also, some more modern dictionary file formats are supported — <a class="ulink" href="https://en.wikipedia.org/wiki/MySpell" target="_top">MySpell</a> (OO &lt; 2.0.1)
 494     and <a class="ulink" href="https://hunspell.github.io/" target="_top">Hunspell</a>
 495     (OO &gt;= 2.0.2).  A large list of dictionaries is available on the <a class="ulink" href="https://wiki.openoffice.org/wiki/Dictionaries" target="_top">OpenOffice
 496     Wiki</a>.
 497    </p><p>
 498     To create an <span class="application">Ispell</span> dictionary perform these steps:
 499    </p><div class="itemizedlist"><ul class="itemizedlist compact" style="list-style-type: bullet; "><li class="listitem" style="list-style-type: disc"><p>
 500       download dictionary configuration files. <span class="productname">OpenOffice</span>
 501       extension files have the <code class="filename">.oxt</code> extension. It is necessary
 502       to extract <code class="filename">.aff</code> and <code class="filename">.dic</code> files, change
 503       extensions to <code class="filename">.affix</code> and <code class="filename">.dict</code>. For some
 504       dictionary files it is also needed to convert characters to the UTF-8
 505       encoding with commands (for example, for a Norwegian language dictionary):
 506 </p><pre class="programlisting">
 507 iconv -f ISO_8859-1 -t UTF-8 -o nn_no.affix nn_NO.aff
 508 iconv -f ISO_8859-1 -t UTF-8 -o nn_no.dict nn_NO.dic
 509 </pre><p>
 510      </p></li><li class="listitem" style="list-style-type: disc"><p>
 511       copy files to the <code class="filename">$SHAREDIR/tsearch_data</code> directory
 512      </p></li><li class="listitem" style="list-style-type: disc"><p>
 513       load files into PostgreSQL with the following command:
 514 </p><pre class="programlisting">
 515 CREATE TEXT SEARCH DICTIONARY english_hunspell (
 516     TEMPLATE = ispell,
 517     DictFile = en_us,
 518     AffFile = en_us,
 519     Stopwords = english);
 520 </pre><p>
 521      </p></li></ul></div><p>
 522     Here, <code class="literal">DictFile</code>, <code class="literal">AffFile</code>, and <code class="literal">StopWords</code>
 523     specify the base names of the dictionary, affixes, and stop-words files.
 524     The stop-words file has the same format explained above for the
 525     <code class="literal">simple</code> dictionary type.  The format of the other files is
 526     not specified here but is available from the above-mentioned web sites.
 527    </p><p>
 528     Ispell dictionaries usually recognize a limited set of words, so they
 529     should be followed by another broader dictionary; for
 530     example, a Snowball dictionary, which recognizes everything.
 531    </p><p>
 532     The <code class="filename">.affix</code> file of <span class="application">Ispell</span> has the following
 533     structure:
 534 </p><pre class="programlisting">
 535 prefixes
 536 flag *A:
 537     .           &gt;   RE      # As in enter &gt; reenter
 538 suffixes
 539 flag T:
 540     E           &gt;   ST      # As in late &gt; latest
 541     [^AEIOU]Y   &gt;   -Y,IEST # As in dirty &gt; dirtiest
 542     [AEIOU]Y    &gt;   EST     # As in gray &gt; grayest
 543     [^EY]       &gt;   EST     # As in small &gt; smallest
 544 </pre><p>
 545    </p><p>
 546     And the <code class="filename">.dict</code> file has the following structure:
 547 </p><pre class="programlisting">
 548 lapse/ADGRS
 549 lard/DGRS
 550 large/PRTY
 551 lark/MRS
 552 </pre><p>
 553    </p><p>
 554     Format of the <code class="filename">.dict</code> file is:
 555 </p><pre class="programlisting">
 556 basic_form/affix_class_name
 557 </pre><p>
 558    </p><p>
 559     In the <code class="filename">.affix</code> file every affix flag is described in the
 560     following format:
 561 </p><pre class="programlisting">
 562 condition &gt; [-stripping_letters,] adding_affix
 563 </pre><p>
 564    </p><p>
 565     Here, condition has a format similar to the format of regular expressions.
 566     It can use groupings <code class="literal">[...]</code> and <code class="literal">[^...]</code>.
 567     For example, <code class="literal">[AEIOU]Y</code> means that the last letter of the word
 568     is <code class="literal">"y"</code> and the penultimate letter is <code class="literal">"a"</code>,
 569     <code class="literal">"e"</code>, <code class="literal">"i"</code>, <code class="literal">"o"</code> or <code class="literal">"u"</code>.
 570     <code class="literal">[^EY]</code> means that the last letter is neither <code class="literal">"e"</code>
 571     nor <code class="literal">"y"</code>.
 572    </p><p>
 573     Ispell dictionaries support splitting compound words;
 574     a useful feature.
 575     Notice that the affix file should specify a special flag using the
 576     <code class="literal">compoundwords controlled</code> statement that marks dictionary
 577     words that can participate in compound formation:
 578
 579 </p><pre class="programlisting">
 580 compoundwords  controlled z
 581 </pre><p>
 582
 583     Here are some examples for the Norwegian language:
 584
 585 </p><pre class="programlisting">
 586 SELECT ts_lexize('norwegian_ispell', 'overbuljongterningpakkmesterassistent');
 587    {over,buljong,terning,pakk,mester,assistent}
 588 SELECT ts_lexize('norwegian_ispell', 'sjokoladefabrikk');
 589    {sjokoladefabrikk,sjokolade,fabrikk}
 590 </pre><p>
 591    </p><p>
 592     <span class="application">MySpell</span> format is a subset of <span class="application">Hunspell</span>.
 593     The <code class="filename">.affix</code> file of <span class="application">Hunspell</span> has the following
 594     structure:
 595 </p><pre class="programlisting">
 596 PFX A Y 1
 597 PFX A   0     re         .
 598 SFX T N 4
 599 SFX T   0     st         e
 600 SFX T   y     iest       [^aeiou]y
 601 SFX T   0     est        [aeiou]y
 602 SFX T   0     est        [^ey]
 603 </pre><p>
 604    </p><p>
 605     The first line of an affix class is the header. Fields of an affix rules are
 606     listed after the header:
 607    </p><div class="itemizedlist"><ul class="itemizedlist compact" style="list-style-type: bullet; "><li class="listitem" style="list-style-type: disc"><p>
 608       parameter name (PFX or SFX)
 609      </p></li><li class="listitem" style="list-style-type: disc"><p>
 610       flag (name of the affix class)
 611      </p></li><li class="listitem" style="list-style-type: disc"><p>
 612       stripping characters from beginning (at prefix) or end (at suffix) of the
 613       word
 614      </p></li><li class="listitem" style="list-style-type: disc"><p>
 615       adding affix
 616      </p></li><li class="listitem" style="list-style-type: disc"><p>
 617       condition that has a format similar to the format of regular expressions.
 618      </p></li></ul></div><p>
 619     The <code class="filename">.dict</code> file looks like the <code class="filename">.dict</code> file of
 620     <span class="application">Ispell</span>:
 621 </p><pre class="programlisting">
 622 larder/M
 623 lardy/RT
 624 large/RSPMYT
 625 largehearted
 626 </pre><p>
 627    </p><div class="note"><h3 class="title">Note</h3><p>
 628      <span class="application">MySpell</span> does not support compound words.
 629      <span class="application">Hunspell</span> has sophisticated support for compound words. At
 630      present, <span class="productname">PostgreSQL</span> implements only the basic
 631      compound word operations of Hunspell.
 632     </p></div></div><div class="sect2" id="TEXTSEARCH-SNOWBALL-DICTIONARY"><div class="titlepage"><div><div><h3 class="title">12.6.6. <span class="application">Snowball</span> Dictionary <a href="#TEXTSEARCH-SNOWBALL-DICTIONARY" class="id_link">#</a></h3></div></div></div><p>
 633     The <span class="application">Snowball</span> dictionary template is based on a project
 634     by Martin Porter, inventor of the popular Porter's stemming algorithm
 635     for the English language.  Snowball now provides stemming algorithms for
 636     many languages (see the <a class="ulink" href="https://snowballstem.org/" target="_top">Snowball
 637     site</a> for more information).  Each algorithm understands how to
 638     reduce common variant forms of words to a base, or stem, spelling within
 639     its language.  A Snowball dictionary requires a <code class="literal">language</code>
 640     parameter to identify which stemmer to use, and optionally can specify a
 641     <code class="literal">stopword</code> file name that gives a list of words to eliminate.
 642     (<span class="productname">PostgreSQL</span>'s standard stopword lists are also
 643     provided by the Snowball project.)
 644     For example, there is a built-in definition equivalent to
 645
 646 </p><pre class="programlisting">
 647 CREATE TEXT SEARCH DICTIONARY english_stem (
 648     TEMPLATE = snowball,
 649     Language = english,
 650     StopWords = english
 651 );
 652 </pre><p>
 653
 654     The stopword file format is the same as already explained.
 655    </p><p>
 656     A <span class="application">Snowball</span> dictionary recognizes everything, whether
 657     or not it is able to simplify the word, so it should be placed
 658     at the end of the dictionary list. It is useless to have it
 659     before any other dictionary because a token will never pass through it to
 660     the next dictionary.
 661    </p></div></div><div class="navfooter"><hr /><table width="100%" summary="Navigation footer"><tr><td width="40%" align="left"><a accesskey="p" href="textsearch-parsers.html" title="12.5. Parsers">Prev</a> </td><td width="20%" align="center"><a accesskey="u" href="textsearch.html" title="Chapter 12. Full Text Search">Up</a></td><td width="40%" align="right"> <a accesskey="n" href="textsearch-configuration.html" title="12.7. Configuration Example">Next</a></td></tr><tr><td width="40%" align="left" valign="top">12.5. Parsers </td><td width="20%" align="center"><a accesskey="h" href="index.html" title="PostgreSQL 18.0 Documentation">Home</a></td><td width="40%" align="right" valign="top"> 12.7. Configuration Example</td></tr></table></div></body></html>