2 12.8. Testing and Debugging Text Search #
4 12.8.1. Configuration Testing
6 12.8.3. Dictionary Testing
8 The behavior of a custom text search configuration can easily become
9 confusing. The functions described in this section are useful for
10 testing text search objects. You can test a complete configuration, or
11 test parsers and dictionaries separately.
13 12.8.1. Configuration Testing #
15 The function ts_debug allows easy testing of a text search
17 ts_debug([ config regconfig, ] document text,
21 OUT dictionaries regdictionary[],
22 OUT dictionary regdictionary,
26 ts_debug displays information about every token of document as produced
27 by the parser and processed by the configured dictionaries. It uses the
28 configuration specified by config, or default_text_search_config if
29 that argument is omitted.
31 ts_debug returns one row for each token identified in the text by the
32 parser. The columns returned are
33 * alias text — short name of the token type
34 * description text — description of the token type
35 * token text — text of the token
36 * dictionaries regdictionary[] — the dictionaries selected by the
37 configuration for this token type
38 * dictionary regdictionary — the dictionary that recognized the
39 token, or NULL if none did
40 * lexemes text[] — the lexeme(s) produced by the dictionary that
41 recognized the token, or NULL if none did; an empty array ({})
42 means it was recognized as a stop word
44 Here is a simple example:
45 SELECT * FROM ts_debug('english', 'a fat cat sat on a mat - it ate a fat rats')
47 alias | description | token | dictionaries | dictionary | lexemes
48 -----------+-----------------+-------+----------------+--------------+---------
49 asciiword | Word, all ASCII | a | {english_stem} | english_stem | {}
50 blank | Space symbols | | {} | |
51 asciiword | Word, all ASCII | fat | {english_stem} | english_stem | {fat}
52 blank | Space symbols | | {} | |
53 asciiword | Word, all ASCII | cat | {english_stem} | english_stem | {cat}
54 blank | Space symbols | | {} | |
55 asciiword | Word, all ASCII | sat | {english_stem} | english_stem | {sat}
56 blank | Space symbols | | {} | |
57 asciiword | Word, all ASCII | on | {english_stem} | english_stem | {}
58 blank | Space symbols | | {} | |
59 asciiword | Word, all ASCII | a | {english_stem} | english_stem | {}
60 blank | Space symbols | | {} | |
61 asciiword | Word, all ASCII | mat | {english_stem} | english_stem | {mat}
62 blank | Space symbols | | {} | |
63 blank | Space symbols | - | {} | |
64 asciiword | Word, all ASCII | it | {english_stem} | english_stem | {}
65 blank | Space symbols | | {} | |
66 asciiword | Word, all ASCII | ate | {english_stem} | english_stem | {ate}
67 blank | Space symbols | | {} | |
68 asciiword | Word, all ASCII | a | {english_stem} | english_stem | {}
69 blank | Space symbols | | {} | |
70 asciiword | Word, all ASCII | fat | {english_stem} | english_stem | {fat}
71 blank | Space symbols | | {} | |
72 asciiword | Word, all ASCII | rats | {english_stem} | english_stem | {rat}
74 For a more extensive demonstration, we first create a public.english
75 configuration and Ispell dictionary for the English language:
76 CREATE TEXT SEARCH CONFIGURATION public.english ( COPY = pg_catalog.english );
78 CREATE TEXT SEARCH DICTIONARY english_ispell (
85 ALTER TEXT SEARCH CONFIGURATION public.english
86 ALTER MAPPING FOR asciiword WITH english_ispell, english_stem;
88 SELECT * FROM ts_debug('public.english', 'The Brightest supernovaes');
89 alias | description | token | dictionaries | d
91 -----------+-----------------+-------------+-------------------------------+----
92 ------------+-------------
93 asciiword | Word, all ASCII | The | {english_ispell,english_stem} | eng
95 blank | Space symbols | | {} |
97 asciiword | Word, all ASCII | Brightest | {english_ispell,english_stem} | eng
98 lish_ispell | {bright}
99 blank | Space symbols | | {} |
101 asciiword | Word, all ASCII | supernovaes | {english_ispell,english_stem} | eng
102 lish_stem | {supernova}
104 In this example, the word Brightest was recognized by the parser as an
105 ASCII word (alias asciiword). For this token type the dictionary list
106 is english_ispell and english_stem. The word was recognized by
107 english_ispell, which reduced it to the noun bright. The word
108 supernovaes is unknown to the english_ispell dictionary so it was
109 passed to the next dictionary, and, fortunately, was recognized (in
110 fact, english_stem is a Snowball dictionary which recognizes
111 everything; that is why it was placed at the end of the dictionary
114 The word The was recognized by the english_ispell dictionary as a stop
115 word (Section 12.6.1) and will not be indexed. The spaces are discarded
116 too, since the configuration provides no dictionaries at all for them.
118 You can reduce the width of the output by explicitly specifying which
119 columns you want to see:
120 SELECT alias, token, dictionary, lexemes
121 FROM ts_debug('public.english', 'The Brightest supernovaes');
122 alias | token | dictionary | lexemes
123 -----------+-------------+----------------+-------------
124 asciiword | The | english_ispell | {}
126 asciiword | Brightest | english_ispell | {bright}
128 asciiword | supernovaes | english_stem | {supernova}
130 12.8.2. Parser Testing #
132 The following functions allow direct testing of a text search parser.
133 ts_parse(parser_name text, document text,
134 OUT tokid integer, OUT token text) returns setof record
135 ts_parse(parser_oid oid, document text,
136 OUT tokid integer, OUT token text) returns setof record
138 ts_parse parses the given document and returns a series of records, one
139 for each token produced by parsing. Each record includes a tokid
140 showing the assigned token type and a token which is the text of the
142 SELECT * FROM ts_parse('default', '123 - a number');
152 ts_token_type(parser_name text, OUT tokid integer,
153 OUT alias text, OUT description text) returns setof record
154 ts_token_type(parser_oid oid, OUT tokid integer,
155 OUT alias text, OUT description text) returns setof record
157 ts_token_type returns a table which describes each type of token the
158 specified parser can recognize. For each token type, the table gives
159 the integer tokid that the parser uses to label a token of that type,
160 the alias that names the token type in configuration commands, and a
161 short description. For example:
162 SELECT * FROM ts_token_type('default');
163 tokid | alias | description
164 -------+-----------------+------------------------------------------
165 1 | asciiword | Word, all ASCII
166 2 | word | Word, all letters
167 3 | numword | Word, letters and digits
168 4 | email | Email address
171 7 | sfloat | Scientific notation
172 8 | version | Version number
173 9 | hword_numpart | Hyphenated word part, letters and digits
174 10 | hword_part | Hyphenated word part, all letters
175 11 | hword_asciipart | Hyphenated word part, all ASCII
176 12 | blank | Space symbols
178 14 | protocol | Protocol head
179 15 | numhword | Hyphenated word, letters and digits
180 16 | asciihword | Hyphenated word, all ASCII
181 17 | hword | Hyphenated word, all letters
182 18 | url_path | URL path
183 19 | file | File or path name
184 20 | float | Decimal notation
185 21 | int | Signed integer
186 22 | uint | Unsigned integer
187 23 | entity | XML entity
189 12.8.3. Dictionary Testing #
191 The ts_lexize function facilitates dictionary testing.
192 ts_lexize(dict regdictionary, token text) returns text[]
194 ts_lexize returns an array of lexemes if the input token is known to
195 the dictionary, or an empty array if the token is known to the
196 dictionary but it is a stop word, or NULL if it is an unknown word.
199 SELECT ts_lexize('english_stem', 'stars');
204 SELECT ts_lexize('english_stem', 'a');
211 The ts_lexize function expects a single token, not text. Here is a case
212 where this can be confusing:
213 SELECT ts_lexize('thesaurus_astro', 'supernovae stars') is null;
218 The thesaurus dictionary thesaurus_astro does know the phrase
219 supernovae stars, but ts_lexize fails since it does not parse the input
220 text but treats it as a single token. Use plainto_tsquery or
221 to_tsvector to test thesaurus dictionaries, for example:
222 SELECT plainto_tsquery('supernovae stars');