begriffs open source - ai-pg/blob - full-docs/txt/textsearch-features.txt

   1
   2 12.4. Additional Features #
   3
   4    12.4.1. Manipulating Documents
   5    12.4.2. Manipulating Queries
   6    12.4.3. Triggers for Automatic Updates
   7    12.4.4. Gathering Document Statistics
   8
   9    This section describes additional functions and operators that are
  10    useful in connection with text search.
  11
  12 12.4.1. Manipulating Documents #
  13
  14    Section 12.3.1 showed how raw textual documents can be converted into
  15    tsvector values. PostgreSQL also provides functions and operators that
  16    can be used to manipulate documents that are already in tsvector form.
  17
  18    tsvector || tsvector
  19           The tsvector concatenation operator returns a vector which
  20           combines the lexemes and positional information of the two
  21           vectors given as arguments. Positions and weight labels are
  22           retained during the concatenation. Positions appearing in the
  23           right-hand vector are offset by the largest position mentioned
  24           in the left-hand vector, so that the result is nearly equivalent
  25           to the result of performing to_tsvector on the concatenation of
  26           the two original document strings. (The equivalence is not
  27           exact, because any stop-words removed from the end of the
  28           left-hand argument will not affect the result, whereas they
  29           would have affected the positions of the lexemes in the
  30           right-hand argument if textual concatenation were used.)
  31
  32           One advantage of using concatenation in the vector form, rather
  33           than concatenating text before applying to_tsvector, is that you
  34           can use different configurations to parse different sections of
  35           the document. Also, because the setweight function marks all
  36           lexemes of the given vector the same way, it is necessary to
  37           parse the text and do setweight before concatenating if you want
  38           to label different parts of the document with different weights.
  39
  40    setweight(vector tsvector, weight "char") returns tsvector
  41           setweight returns a copy of the input vector in which every
  42           position has been labeled with the given weight, either A, B, C,
  43           or D. (D is the default for new vectors and as such is not
  44           displayed on output.) These labels are retained when vectors are
  45           concatenated, allowing words from different parts of a document
  46           to be weighted differently by ranking functions.
  47
  48           Note that weight labels apply to positions, not lexemes. If the
  49           input vector has been stripped of positions then setweight does
  50           nothing.
  51
  52    length(vector tsvector) returns integer
  53           Returns the number of lexemes stored in the vector.
  54
  55    strip(vector tsvector) returns tsvector
  56           Returns a vector that lists the same lexemes as the given
  57           vector, but lacks any position or weight information. The result
  58           is usually much smaller than an unstripped vector, but it is
  59           also less useful. Relevance ranking does not work as well on
  60           stripped vectors as unstripped ones. Also, the <-> (FOLLOWED BY)
  61           tsquery operator will never match stripped input, since it
  62           cannot determine the distance between lexeme occurrences.
  63
  64    A full list of tsvector-related functions is available in Table 9.43.
  65
  66 12.4.2. Manipulating Queries #
  67
  68    Section 12.3.2 showed how raw textual queries can be converted into
  69    tsquery values. PostgreSQL also provides functions and operators that
  70    can be used to manipulate queries that are already in tsquery form.
  71
  72    tsquery && tsquery
  73           Returns the AND-combination of the two given queries.
  74
  75    tsquery || tsquery
  76           Returns the OR-combination of the two given queries.
  77
  78    !! tsquery
  79           Returns the negation (NOT) of the given query.
  80
  81    tsquery <-> tsquery
  82           Returns a query that searches for a match to the first given
  83           query immediately followed by a match to the second given query,
  84           using the <-> (FOLLOWED BY) tsquery operator. For example:
  85
  86 SELECT to_tsquery('fat') <-> to_tsquery('cat | rat');
  87           ?column?
  88 ----------------------------
  89  'fat' <-> ( 'cat' | 'rat' )
  90
  91    tsquery_phrase(query1 tsquery, query2 tsquery [, distance integer ])
  92           returns tsquery
  93           Returns a query that searches for a match to the first given
  94           query followed by a match to the second given query at a
  95           distance of exactly distance lexemes, using the <N> tsquery
  96           operator. For example:
  97
  98 SELECT tsquery_phrase(to_tsquery('fat'), to_tsquery('cat'), 10);
  99   tsquery_phrase
 100 ------------------
 101  'fat' <10> 'cat'
 102
 103    numnode(query tsquery) returns integer
 104           Returns the number of nodes (lexemes plus operators) in a
 105           tsquery. This function is useful to determine if the query is
 106           meaningful (returns > 0), or contains only stop words (returns
 107           0). Examples:
 108
 109 SELECT numnode(plainto_tsquery('the any'));
 110 NOTICE:  query contains only stopword(s) or doesn't contain lexeme(s), ignored
 111  numnode
 112 ---------
 113        0
 114
 115 SELECT numnode('foo & bar'::tsquery);
 116  numnode
 117 ---------
 118        3
 119
 120    querytree(query tsquery) returns text
 121           Returns the portion of a tsquery that can be used for searching
 122           an index. This function is useful for detecting unindexable
 123           queries, for example those containing only stop words or only
 124           negated terms. For example:
 125
 126 SELECT querytree(to_tsquery('defined'));
 127  querytree
 128 -----------
 129  'defin'
 130
 131 SELECT querytree(to_tsquery('!defined'));
 132  querytree
 133 -----------
 134  T
 135
 136 12.4.2.1. Query Rewriting #
 137
 138    The ts_rewrite family of functions search a given tsquery for
 139    occurrences of a target subquery, and replace each occurrence with a
 140    substitute subquery. In essence this operation is a tsquery-specific
 141    version of substring replacement. A target and substitute combination
 142    can be thought of as a query rewrite rule. A collection of such rewrite
 143    rules can be a powerful search aid. For example, you can expand the
 144    search using synonyms (e.g., new york, big apple, nyc, gotham) or
 145    narrow the search to direct the user to some hot topic. There is some
 146    overlap in functionality between this feature and thesaurus
 147    dictionaries (Section 12.6.4). However, you can modify a set of rewrite
 148    rules on-the-fly without reindexing, whereas updating a thesaurus
 149    requires reindexing to be effective.
 150
 151    ts_rewrite (query tsquery, target tsquery, substitute tsquery) returns
 152           tsquery
 153           This form of ts_rewrite simply applies a single rewrite rule:
 154           target is replaced by substitute wherever it appears in query.
 155           For example:
 156
 157 SELECT ts_rewrite('a & b'::tsquery, 'a'::tsquery, 'c'::tsquery);
 158  ts_rewrite
 159 ------------
 160  'b' & 'c'
 161
 162    ts_rewrite (query tsquery, select text) returns tsquery
 163           This form of ts_rewrite accepts a starting query and an SQL
 164           select command, which is given as a text string. The select must
 165           yield two columns of tsquery type. For each row of the select
 166           result, occurrences of the first column value (the target) are
 167           replaced by the second column value (the substitute) within the
 168           current query value. For example:
 169
 170 CREATE TABLE aliases (t tsquery PRIMARY KEY, s tsquery);
 171 INSERT INTO aliases VALUES('a', 'c');
 172
 173 SELECT ts_rewrite('a & b'::tsquery, 'SELECT t,s FROM aliases');
 174  ts_rewrite
 175 ------------
 176  'b' & 'c'
 177
 178           Note that when multiple rewrite rules are applied in this way,
 179           the order of application can be important; so in practice you
 180           will want the source query to ORDER BY some ordering key.
 181
 182    Let's consider a real-life astronomical example. We'll expand query
 183    supernovae using table-driven rewriting rules:
 184 CREATE TABLE aliases (t tsquery primary key, s tsquery);
 185 INSERT INTO aliases VALUES(to_tsquery('supernovae'), to_tsquery('supernovae|sn')
 186 );
 187
 188 SELECT ts_rewrite(to_tsquery('supernovae & crab'), 'SELECT * FROM aliases');
 189            ts_rewrite
 190 ---------------------------------
 191  'crab' & ( 'supernova' | 'sn' )
 192
 193    We can change the rewriting rules just by updating the table:
 194 UPDATE aliases
 195 SET s = to_tsquery('supernovae|sn & !nebulae')
 196 WHERE t = to_tsquery('supernovae');
 197
 198 SELECT ts_rewrite(to_tsquery('supernovae & crab'), 'SELECT * FROM aliases');
 199                  ts_rewrite
 200 ---------------------------------------------
 201  'crab' & ( 'supernova' | 'sn' & !'nebula' )
 202
 203    Rewriting can be slow when there are many rewriting rules, since it
 204    checks every rule for a possible match. To filter out obvious
 205    non-candidate rules we can use the containment operators for the
 206    tsquery type. In the example below, we select only those rules which
 207    might match the original query:
 208 SELECT ts_rewrite('a & b'::tsquery,
 209                   'SELECT t,s FROM aliases WHERE ''a & b''::tsquery @> t');
 210  ts_rewrite
 211 ------------
 212  'b' & 'c'
 213
 214 12.4.3. Triggers for Automatic Updates #
 215
 216 Note
 217
 218    The method described in this section has been obsoleted by the use of
 219    stored generated columns, as described in Section 12.2.2.
 220
 221    When using a separate column to store the tsvector representation of
 222    your documents, it is necessary to create a trigger to update the
 223    tsvector column when the document content columns change. Two built-in
 224    trigger functions are available for this, or you can write your own.
 225 tsvector_update_trigger(tsvector_column_name, config_name, text_column_name [,
 226 ... ])
 227 tsvector_update_trigger_column(tsvector_column_name, config_column_name, text_c
 228 olumn_name [, ... ])
 229
 230    These trigger functions automatically compute a tsvector column from
 231    one or more textual columns, under the control of parameters specified
 232    in the CREATE TRIGGER command. An example of their use is:
 233 CREATE TABLE messages (
 234     title       text,
 235     body        text,
 236     tsv         tsvector
 237 );
 238
 239 CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE
 240 ON messages FOR EACH ROW EXECUTE FUNCTION
 241 tsvector_update_trigger(tsv, 'pg_catalog.english', title, body);
 242
 243 INSERT INTO messages VALUES('title here', 'the body text is here');
 244
 245 SELECT * FROM messages;
 246    title    |         body          |            tsv
 247 ------------+-----------------------+----------------------------
 248  title here | the body text is here | 'bodi':4 'text':5 'titl':1
 249
 250 SELECT title, body FROM messages WHERE tsv @@ to_tsquery('title & body');
 251    title    |         body
 252 ------------+-----------------------
 253  title here | the body text is here
 254
 255    Having created this trigger, any change in title or body will
 256    automatically be reflected into tsv, without the application having to
 257    worry about it.
 258
 259    The first trigger argument must be the name of the tsvector column to
 260    be updated. The second argument specifies the text search configuration
 261    to be used to perform the conversion. For tsvector_update_trigger, the
 262    configuration name is simply given as the second trigger argument. It
 263    must be schema-qualified as shown above, so that the trigger behavior
 264    will not change with changes in search_path. For
 265    tsvector_update_trigger_column, the second trigger argument is the name
 266    of another table column, which must be of type regconfig. This allows a
 267    per-row selection of configuration to be made. The remaining
 268    argument(s) are the names of textual columns (of type text, varchar, or
 269    char). These will be included in the document in the order given. NULL
 270    values will be skipped (but the other columns will still be indexed).
 271
 272    A limitation of these built-in triggers is that they treat all the
 273    input columns alike. To process columns differently — for example, to
 274    weight title differently from body — it is necessary to write a custom
 275    trigger. Here is an example using PL/pgSQL as the trigger language:
 276 CREATE FUNCTION messages_trigger() RETURNS trigger AS $$
 277 begin
 278   new.tsv :=
 279      setweight(to_tsvector('pg_catalog.english', coalesce(new.title,'')), 'A') |
 280 |
 281      setweight(to_tsvector('pg_catalog.english', coalesce(new.body,'')), 'D');
 282   return new;
 283 end
 284 $$ LANGUAGE plpgsql;
 285
 286 CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE
 287     ON messages FOR EACH ROW EXECUTE FUNCTION messages_trigger();
 288
 289    Keep in mind that it is important to specify the configuration name
 290    explicitly when creating tsvector values inside triggers, so that the
 291    column's contents will not be affected by changes to
 292    default_text_search_config. Failure to do this is likely to lead to
 293    problems such as search results changing after a dump and restore.
 294
 295 12.4.4. Gathering Document Statistics #
 296
 297    The function ts_stat is useful for checking your configuration and for
 298    finding stop-word candidates.
 299 ts_stat(sqlquery text, [ weights text, ]
 300         OUT word text, OUT ndoc integer,
 301         OUT nentry integer) returns setof record
 302
 303    sqlquery is a text value containing an SQL query which must return a
 304    single tsvector column. ts_stat executes the query and returns
 305    statistics about each distinct lexeme (word) contained in the tsvector
 306    data. The columns returned are
 307      * word text — the value of a lexeme
 308      * ndoc integer — number of documents (tsvectors) the word occurred in
 309      * nentry integer — total number of occurrences of the word
 310
 311    If weights is supplied, only occurrences having one of those weights
 312    are counted.
 313
 314    For example, to find the ten most frequent words in a document
 315    collection:
 316 SELECT * FROM ts_stat('SELECT vector FROM apod')
 317 ORDER BY nentry DESC, ndoc DESC, word
 318 LIMIT 10;
 319
 320    The same, but counting only word occurrences with weight A or B:
 321 SELECT * FROM ts_stat('SELECT vector FROM apod', 'ab')
 322 ORDER BY nentry DESC, ndoc DESC, word
 323 LIMIT 10;