2 12.4. Additional Features #
4 12.4.1. Manipulating Documents
5 12.4.2. Manipulating Queries
6 12.4.3. Triggers for Automatic Updates
7 12.4.4. Gathering Document Statistics
9 This section describes additional functions and operators that are
10 useful in connection with text search.
12 12.4.1. Manipulating Documents #
14 Section 12.3.1 showed how raw textual documents can be converted into
15 tsvector values. PostgreSQL also provides functions and operators that
16 can be used to manipulate documents that are already in tsvector form.
19 The tsvector concatenation operator returns a vector which
20 combines the lexemes and positional information of the two
21 vectors given as arguments. Positions and weight labels are
22 retained during the concatenation. Positions appearing in the
23 right-hand vector are offset by the largest position mentioned
24 in the left-hand vector, so that the result is nearly equivalent
25 to the result of performing to_tsvector on the concatenation of
26 the two original document strings. (The equivalence is not
27 exact, because any stop-words removed from the end of the
28 left-hand argument will not affect the result, whereas they
29 would have affected the positions of the lexemes in the
30 right-hand argument if textual concatenation were used.)
32 One advantage of using concatenation in the vector form, rather
33 than concatenating text before applying to_tsvector, is that you
34 can use different configurations to parse different sections of
35 the document. Also, because the setweight function marks all
36 lexemes of the given vector the same way, it is necessary to
37 parse the text and do setweight before concatenating if you want
38 to label different parts of the document with different weights.
40 setweight(vector tsvector, weight "char") returns tsvector
41 setweight returns a copy of the input vector in which every
42 position has been labeled with the given weight, either A, B, C,
43 or D. (D is the default for new vectors and as such is not
44 displayed on output.) These labels are retained when vectors are
45 concatenated, allowing words from different parts of a document
46 to be weighted differently by ranking functions.
48 Note that weight labels apply to positions, not lexemes. If the
49 input vector has been stripped of positions then setweight does
52 length(vector tsvector) returns integer
53 Returns the number of lexemes stored in the vector.
55 strip(vector tsvector) returns tsvector
56 Returns a vector that lists the same lexemes as the given
57 vector, but lacks any position or weight information. The result
58 is usually much smaller than an unstripped vector, but it is
59 also less useful. Relevance ranking does not work as well on
60 stripped vectors as unstripped ones. Also, the <-> (FOLLOWED BY)
61 tsquery operator will never match stripped input, since it
62 cannot determine the distance between lexeme occurrences.
64 A full list of tsvector-related functions is available in Table 9.43.
66 12.4.2. Manipulating Queries #
68 Section 12.3.2 showed how raw textual queries can be converted into
69 tsquery values. PostgreSQL also provides functions and operators that
70 can be used to manipulate queries that are already in tsquery form.
73 Returns the AND-combination of the two given queries.
76 Returns the OR-combination of the two given queries.
79 Returns the negation (NOT) of the given query.
82 Returns a query that searches for a match to the first given
83 query immediately followed by a match to the second given query,
84 using the <-> (FOLLOWED BY) tsquery operator. For example:
86 SELECT to_tsquery('fat') <-> to_tsquery('cat | rat');
88 ----------------------------
89 'fat' <-> ( 'cat' | 'rat' )
91 tsquery_phrase(query1 tsquery, query2 tsquery [, distance integer ])
93 Returns a query that searches for a match to the first given
94 query followed by a match to the second given query at a
95 distance of exactly distance lexemes, using the <N> tsquery
96 operator. For example:
98 SELECT tsquery_phrase(to_tsquery('fat'), to_tsquery('cat'), 10);
103 numnode(query tsquery) returns integer
104 Returns the number of nodes (lexemes plus operators) in a
105 tsquery. This function is useful to determine if the query is
106 meaningful (returns > 0), or contains only stop words (returns
109 SELECT numnode(plainto_tsquery('the any'));
110 NOTICE: query contains only stopword(s) or doesn't contain lexeme(s), ignored
115 SELECT numnode('foo & bar'::tsquery);
120 querytree(query tsquery) returns text
121 Returns the portion of a tsquery that can be used for searching
122 an index. This function is useful for detecting unindexable
123 queries, for example those containing only stop words or only
124 negated terms. For example:
126 SELECT querytree(to_tsquery('defined'));
131 SELECT querytree(to_tsquery('!defined'));
136 12.4.2.1. Query Rewriting #
138 The ts_rewrite family of functions search a given tsquery for
139 occurrences of a target subquery, and replace each occurrence with a
140 substitute subquery. In essence this operation is a tsquery-specific
141 version of substring replacement. A target and substitute combination
142 can be thought of as a query rewrite rule. A collection of such rewrite
143 rules can be a powerful search aid. For example, you can expand the
144 search using synonyms (e.g., new york, big apple, nyc, gotham) or
145 narrow the search to direct the user to some hot topic. There is some
146 overlap in functionality between this feature and thesaurus
147 dictionaries (Section 12.6.4). However, you can modify a set of rewrite
148 rules on-the-fly without reindexing, whereas updating a thesaurus
149 requires reindexing to be effective.
151 ts_rewrite (query tsquery, target tsquery, substitute tsquery) returns
153 This form of ts_rewrite simply applies a single rewrite rule:
154 target is replaced by substitute wherever it appears in query.
157 SELECT ts_rewrite('a & b'::tsquery, 'a'::tsquery, 'c'::tsquery);
162 ts_rewrite (query tsquery, select text) returns tsquery
163 This form of ts_rewrite accepts a starting query and an SQL
164 select command, which is given as a text string. The select must
165 yield two columns of tsquery type. For each row of the select
166 result, occurrences of the first column value (the target) are
167 replaced by the second column value (the substitute) within the
168 current query value. For example:
170 CREATE TABLE aliases (t tsquery PRIMARY KEY, s tsquery);
171 INSERT INTO aliases VALUES('a', 'c');
173 SELECT ts_rewrite('a & b'::tsquery, 'SELECT t,s FROM aliases');
178 Note that when multiple rewrite rules are applied in this way,
179 the order of application can be important; so in practice you
180 will want the source query to ORDER BY some ordering key.
182 Let's consider a real-life astronomical example. We'll expand query
183 supernovae using table-driven rewriting rules:
184 CREATE TABLE aliases (t tsquery primary key, s tsquery);
185 INSERT INTO aliases VALUES(to_tsquery('supernovae'), to_tsquery('supernovae|sn')
188 SELECT ts_rewrite(to_tsquery('supernovae & crab'), 'SELECT * FROM aliases');
190 ---------------------------------
191 'crab' & ( 'supernova' | 'sn' )
193 We can change the rewriting rules just by updating the table:
195 SET s = to_tsquery('supernovae|sn & !nebulae')
196 WHERE t = to_tsquery('supernovae');
198 SELECT ts_rewrite(to_tsquery('supernovae & crab'), 'SELECT * FROM aliases');
200 ---------------------------------------------
201 'crab' & ( 'supernova' | 'sn' & !'nebula' )
203 Rewriting can be slow when there are many rewriting rules, since it
204 checks every rule for a possible match. To filter out obvious
205 non-candidate rules we can use the containment operators for the
206 tsquery type. In the example below, we select only those rules which
207 might match the original query:
208 SELECT ts_rewrite('a & b'::tsquery,
209 'SELECT t,s FROM aliases WHERE ''a & b''::tsquery @> t');
214 12.4.3. Triggers for Automatic Updates #
218 The method described in this section has been obsoleted by the use of
219 stored generated columns, as described in Section 12.2.2.
221 When using a separate column to store the tsvector representation of
222 your documents, it is necessary to create a trigger to update the
223 tsvector column when the document content columns change. Two built-in
224 trigger functions are available for this, or you can write your own.
225 tsvector_update_trigger(tsvector_column_name, config_name, text_column_name [,
227 tsvector_update_trigger_column(tsvector_column_name, config_column_name, text_c
230 These trigger functions automatically compute a tsvector column from
231 one or more textual columns, under the control of parameters specified
232 in the CREATE TRIGGER command. An example of their use is:
233 CREATE TABLE messages (
239 CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE
240 ON messages FOR EACH ROW EXECUTE FUNCTION
241 tsvector_update_trigger(tsv, 'pg_catalog.english', title, body);
243 INSERT INTO messages VALUES('title here', 'the body text is here');
245 SELECT * FROM messages;
247 ------------+-----------------------+----------------------------
248 title here | the body text is here | 'bodi':4 'text':5 'titl':1
250 SELECT title, body FROM messages WHERE tsv @@ to_tsquery('title & body');
252 ------------+-----------------------
253 title here | the body text is here
255 Having created this trigger, any change in title or body will
256 automatically be reflected into tsv, without the application having to
259 The first trigger argument must be the name of the tsvector column to
260 be updated. The second argument specifies the text search configuration
261 to be used to perform the conversion. For tsvector_update_trigger, the
262 configuration name is simply given as the second trigger argument. It
263 must be schema-qualified as shown above, so that the trigger behavior
264 will not change with changes in search_path. For
265 tsvector_update_trigger_column, the second trigger argument is the name
266 of another table column, which must be of type regconfig. This allows a
267 per-row selection of configuration to be made. The remaining
268 argument(s) are the names of textual columns (of type text, varchar, or
269 char). These will be included in the document in the order given. NULL
270 values will be skipped (but the other columns will still be indexed).
272 A limitation of these built-in triggers is that they treat all the
273 input columns alike. To process columns differently — for example, to
274 weight title differently from body — it is necessary to write a custom
275 trigger. Here is an example using PL/pgSQL as the trigger language:
276 CREATE FUNCTION messages_trigger() RETURNS trigger AS $$
279 setweight(to_tsvector('pg_catalog.english', coalesce(new.title,'')), 'A') |
281 setweight(to_tsvector('pg_catalog.english', coalesce(new.body,'')), 'D');
286 CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE
287 ON messages FOR EACH ROW EXECUTE FUNCTION messages_trigger();
289 Keep in mind that it is important to specify the configuration name
290 explicitly when creating tsvector values inside triggers, so that the
291 column's contents will not be affected by changes to
292 default_text_search_config. Failure to do this is likely to lead to
293 problems such as search results changing after a dump and restore.
295 12.4.4. Gathering Document Statistics #
297 The function ts_stat is useful for checking your configuration and for
298 finding stop-word candidates.
299 ts_stat(sqlquery text, [ weights text, ]
300 OUT word text, OUT ndoc integer,
301 OUT nentry integer) returns setof record
303 sqlquery is a text value containing an SQL query which must return a
304 single tsvector column. ts_stat executes the query and returns
305 statistics about each distinct lexeme (word) contained in the tsvector
306 data. The columns returned are
307 * word text — the value of a lexeme
308 * ndoc integer — number of documents (tsvectors) the word occurred in
309 * nentry integer — total number of occurrences of the word
311 If weights is supplied, only occurrences having one of those weights
314 For example, to find the ten most frequent words in a document
316 SELECT * FROM ts_stat('SELECT vector FROM apod')
317 ORDER BY nentry DESC, ndoc DESC, word
320 The same, but counting only word occurrences with weight A or B:
321 SELECT * FROM ts_stat('SELECT vector FROM apod', 'ab')
322 ORDER BY nentry DESC, ndoc DESC, word