begriffs open source - ai-pg/blob - full-docs/txt/textsearch-debugging.txt

   1
   2 12.8. Testing and Debugging Text Search #
   3
   4    12.8.1. Configuration Testing
   5    12.8.2. Parser Testing
   6    12.8.3. Dictionary Testing
   7
   8    The behavior of a custom text search configuration can easily become
   9    confusing. The functions described in this section are useful for
  10    testing text search objects. You can test a complete configuration, or
  11    test parsers and dictionaries separately.
  12
  13 12.8.1. Configuration Testing #
  14
  15    The function ts_debug allows easy testing of a text search
  16    configuration.
  17 ts_debug([ config regconfig, ] document text,
  18          OUT alias text,
  19          OUT description text,
  20          OUT token text,
  21          OUT dictionaries regdictionary[],
  22          OUT dictionary regdictionary,
  23          OUT lexemes text[])
  24          returns setof record
  25
  26    ts_debug displays information about every token of document as produced
  27    by the parser and processed by the configured dictionaries. It uses the
  28    configuration specified by config, or default_text_search_config if
  29    that argument is omitted.
  30
  31    ts_debug returns one row for each token identified in the text by the
  32    parser. The columns returned are
  33      * alias text — short name of the token type
  34      * description text — description of the token type
  35      * token text — text of the token
  36      * dictionaries regdictionary[] — the dictionaries selected by the
  37        configuration for this token type
  38      * dictionary regdictionary — the dictionary that recognized the
  39        token, or NULL if none did
  40      * lexemes text[] — the lexeme(s) produced by the dictionary that
  41        recognized the token, or NULL if none did; an empty array ({})
  42        means it was recognized as a stop word
  43
  44    Here is a simple example:
  45 SELECT * FROM ts_debug('english', 'a fat  cat sat on a mat - it ate a fat rats')
  46 ;
  47    alias   |   description   | token |  dictionaries  |  dictionary  | lexemes
  48 -----------+-----------------+-------+----------------+--------------+---------
  49  asciiword | Word, all ASCII | a     | {english_stem} | english_stem | {}
  50  blank     | Space symbols   |       | {}             |              |
  51  asciiword | Word, all ASCII | fat   | {english_stem} | english_stem | {fat}
  52  blank     | Space symbols   |       | {}             |              |
  53  asciiword | Word, all ASCII | cat   | {english_stem} | english_stem | {cat}
  54  blank     | Space symbols   |       | {}             |              |
  55  asciiword | Word, all ASCII | sat   | {english_stem} | english_stem | {sat}
  56  blank     | Space symbols   |       | {}             |              |
  57  asciiword | Word, all ASCII | on    | {english_stem} | english_stem | {}
  58  blank     | Space symbols   |       | {}             |              |
  59  asciiword | Word, all ASCII | a     | {english_stem} | english_stem | {}
  60  blank     | Space symbols   |       | {}             |              |
  61  asciiword | Word, all ASCII | mat   | {english_stem} | english_stem | {mat}
  62  blank     | Space symbols   |       | {}             |              |
  63  blank     | Space symbols   | -     | {}             |              |
  64  asciiword | Word, all ASCII | it    | {english_stem} | english_stem | {}
  65  blank     | Space symbols   |       | {}             |              |
  66  asciiword | Word, all ASCII | ate   | {english_stem} | english_stem | {ate}
  67  blank     | Space symbols   |       | {}             |              |
  68  asciiword | Word, all ASCII | a     | {english_stem} | english_stem | {}
  69  blank     | Space symbols   |       | {}             |              |
  70  asciiword | Word, all ASCII | fat   | {english_stem} | english_stem | {fat}
  71  blank     | Space symbols   |       | {}             |              |
  72  asciiword | Word, all ASCII | rats  | {english_stem} | english_stem | {rat}
  73
  74    For a more extensive demonstration, we first create a public.english
  75    configuration and Ispell dictionary for the English language:
  76 CREATE TEXT SEARCH CONFIGURATION public.english ( COPY = pg_catalog.english );
  77
  78 CREATE TEXT SEARCH DICTIONARY english_ispell (
  79     TEMPLATE = ispell,
  80     DictFile = english,
  81     AffFile = english,
  82     StopWords = english
  83 );
  84
  85 ALTER TEXT SEARCH CONFIGURATION public.english
  86    ALTER MAPPING FOR asciiword WITH english_ispell, english_stem;
  87
  88 SELECT * FROM ts_debug('public.english', 'The Brightest supernovaes');
  89    alias   |   description   |    token    |         dictionaries          |   d
  90 ictionary   |   lexemes
  91 -----------+-----------------+-------------+-------------------------------+----
  92 ------------+-------------
  93  asciiword | Word, all ASCII | The         | {english_ispell,english_stem} | eng
  94 lish_ispell | {}
  95  blank     | Space symbols   |             | {}                            |
  96             |
  97  asciiword | Word, all ASCII | Brightest   | {english_ispell,english_stem} | eng
  98 lish_ispell | {bright}
  99  blank     | Space symbols   |             | {}                            |
 100             |
 101  asciiword | Word, all ASCII | supernovaes | {english_ispell,english_stem} | eng
 102 lish_stem   | {supernova}
 103
 104    In this example, the word Brightest was recognized by the parser as an
 105    ASCII word (alias asciiword). For this token type the dictionary list
 106    is english_ispell and english_stem. The word was recognized by
 107    english_ispell, which reduced it to the noun bright. The word
 108    supernovaes is unknown to the english_ispell dictionary so it was
 109    passed to the next dictionary, and, fortunately, was recognized (in
 110    fact, english_stem is a Snowball dictionary which recognizes
 111    everything; that is why it was placed at the end of the dictionary
 112    list).
 113
 114    The word The was recognized by the english_ispell dictionary as a stop
 115    word (Section 12.6.1) and will not be indexed. The spaces are discarded
 116    too, since the configuration provides no dictionaries at all for them.
 117
 118    You can reduce the width of the output by explicitly specifying which
 119    columns you want to see:
 120 SELECT alias, token, dictionary, lexemes
 121 FROM ts_debug('public.english', 'The Brightest supernovaes');
 122    alias   |    token    |   dictionary   |   lexemes
 123 -----------+-------------+----------------+-------------
 124  asciiword | The         | english_ispell | {}
 125  blank     |             |                |
 126  asciiword | Brightest   | english_ispell | {bright}
 127  blank     |             |                |
 128  asciiword | supernovaes | english_stem   | {supernova}
 129
 130 12.8.2. Parser Testing #
 131
 132    The following functions allow direct testing of a text search parser.
 133 ts_parse(parser_name text, document text,
 134          OUT tokid integer, OUT token text) returns setof record
 135 ts_parse(parser_oid oid, document text,
 136          OUT tokid integer, OUT token text) returns setof record
 137
 138    ts_parse parses the given document and returns a series of records, one
 139    for each token produced by parsing. Each record includes a tokid
 140    showing the assigned token type and a token which is the text of the
 141    token. For example:
 142 SELECT * FROM ts_parse('default', '123 - a number');
 143  tokid | token
 144 -------+--------
 145     22 | 123
 146     12 |
 147     12 | -
 148      1 | a
 149     12 |
 150      1 | number
 151
 152 ts_token_type(parser_name text, OUT tokid integer,
 153               OUT alias text, OUT description text) returns setof record
 154 ts_token_type(parser_oid oid, OUT tokid integer,
 155               OUT alias text, OUT description text) returns setof record
 156
 157    ts_token_type returns a table which describes each type of token the
 158    specified parser can recognize. For each token type, the table gives
 159    the integer tokid that the parser uses to label a token of that type,
 160    the alias that names the token type in configuration commands, and a
 161    short description. For example:
 162 SELECT * FROM ts_token_type('default');
 163  tokid |      alias      |               description
 164 -------+-----------------+------------------------------------------
 165      1 | asciiword       | Word, all ASCII
 166      2 | word            | Word, all letters
 167      3 | numword         | Word, letters and digits
 168      4 | email           | Email address
 169      5 | url             | URL
 170      6 | host            | Host
 171      7 | sfloat          | Scientific notation
 172      8 | version         | Version number
 173      9 | hword_numpart   | Hyphenated word part, letters and digits
 174     10 | hword_part      | Hyphenated word part, all letters
 175     11 | hword_asciipart | Hyphenated word part, all ASCII
 176     12 | blank           | Space symbols
 177     13 | tag             | XML tag
 178     14 | protocol        | Protocol head
 179     15 | numhword        | Hyphenated word, letters and digits
 180     16 | asciihword      | Hyphenated word, all ASCII
 181     17 | hword           | Hyphenated word, all letters
 182     18 | url_path        | URL path
 183     19 | file            | File or path name
 184     20 | float           | Decimal notation
 185     21 | int             | Signed integer
 186     22 | uint            | Unsigned integer
 187     23 | entity          | XML entity
 188
 189 12.8.3. Dictionary Testing #
 190
 191    The ts_lexize function facilitates dictionary testing.
 192 ts_lexize(dict regdictionary, token text) returns text[]
 193
 194    ts_lexize returns an array of lexemes if the input token is known to
 195    the dictionary, or an empty array if the token is known to the
 196    dictionary but it is a stop word, or NULL if it is an unknown word.
 197
 198    Examples:
 199 SELECT ts_lexize('english_stem', 'stars');
 200  ts_lexize
 201 -----------
 202  {star}
 203
 204 SELECT ts_lexize('english_stem', 'a');
 205  ts_lexize
 206 -----------
 207  {}
 208
 209 Note
 210
 211    The ts_lexize function expects a single token, not text. Here is a case
 212    where this can be confusing:
 213 SELECT ts_lexize('thesaurus_astro', 'supernovae stars') is null;
 214  ?column?
 215 ----------
 216  t
 217
 218    The thesaurus dictionary thesaurus_astro does know the phrase
 219    supernovae stars, but ts_lexize fails since it does not parse the input
 220    text but treats it as a single token. Use plainto_tsquery or
 221    to_tsvector to test thesaurus dictionaries, for example:
 222 SELECT plainto_tsquery('supernovae stars');
 223  plainto_tsquery
 224 -----------------
 225  'sn'