begriffs open source - ai-pg/blob - full-docs/txt/textsearch-controls.txt

   1
   2 12.3. Controlling Text Search #
   3
   4    12.3.1. Parsing Documents
   5    12.3.2. Parsing Queries
   6    12.3.3. Ranking Search Results
   7    12.3.4. Highlighting Results
   8
   9    To implement full text searching there must be a function to create a
  10    tsvector from a document and a tsquery from a user query. Also, we need
  11    to return results in a useful order, so we need a function that
  12    compares documents with respect to their relevance to the query. It's
  13    also important to be able to display the results nicely. PostgreSQL
  14    provides support for all of these functions.
  15
  16 12.3.1. Parsing Documents #
  17
  18    PostgreSQL provides the function to_tsvector for converting a document
  19    to the tsvector data type.
  20 to_tsvector([ config regconfig, ] document text) returns tsvector
  21
  22    to_tsvector parses a textual document into tokens, reduces the tokens
  23    to lexemes, and returns a tsvector which lists the lexemes together
  24    with their positions in the document. The document is processed
  25    according to the specified or default text search configuration. Here
  26    is a simple example:
  27 SELECT to_tsvector('english', 'a fat  cat sat on a mat - it ate a fat rats');
  28                   to_tsvector
  29 -----------------------------------------------------
  30  'ate':9 'cat':3 'fat':2,11 'mat':7 'rat':12 'sat':4
  31
  32    In the example above we see that the resulting tsvector does not
  33    contain the words a, on, or it, the word rats became rat, and the
  34    punctuation sign - was ignored.
  35
  36    The to_tsvector function internally calls a parser which breaks the
  37    document text into tokens and assigns a type to each token. For each
  38    token, a list of dictionaries (Section 12.6) is consulted, where the
  39    list can vary depending on the token type. The first dictionary that
  40    recognizes the token emits one or more normalized lexemes to represent
  41    the token. For example, rats became rat because one of the dictionaries
  42    recognized that the word rats is a plural form of rat. Some words are
  43    recognized as stop words (Section 12.6.1), which causes them to be
  44    ignored since they occur too frequently to be useful in searching. In
  45    our example these are a, on, and it. If no dictionary in the list
  46    recognizes the token then it is also ignored. In this example that
  47    happened to the punctuation sign - because there are in fact no
  48    dictionaries assigned for its token type (Space symbols), meaning space
  49    tokens will never be indexed. The choices of parser, dictionaries and
  50    which types of tokens to index are determined by the selected text
  51    search configuration (Section 12.7). It is possible to have many
  52    different configurations in the same database, and predefined
  53    configurations are available for various languages. In our example we
  54    used the default configuration english for the English language.
  55
  56    The function setweight can be used to label the entries of a tsvector
  57    with a given weight, where a weight is one of the letters A, B, C, or
  58    D. This is typically used to mark entries coming from different parts
  59    of a document, such as title versus body. Later, this information can
  60    be used for ranking of search results.
  61
  62    Because to_tsvector(NULL) will return NULL, it is recommended to use
  63    coalesce whenever a field might be null. Here is the recommended method
  64    for creating a tsvector from a structured document:
  65 UPDATE tt SET ti =
  66     setweight(to_tsvector(coalesce(title,'')), 'A')    ||
  67     setweight(to_tsvector(coalesce(keyword,'')), 'B')  ||
  68     setweight(to_tsvector(coalesce(abstract,'')), 'C') ||
  69     setweight(to_tsvector(coalesce(body,'')), 'D');
  70
  71    Here we have used setweight to label the source of each lexeme in the
  72    finished tsvector, and then merged the labeled tsvector values using
  73    the tsvector concatenation operator ||. (Section 12.4.1 gives details
  74    about these operations.)
  75
  76 12.3.2. Parsing Queries #
  77
  78    PostgreSQL provides the functions to_tsquery, plainto_tsquery,
  79    phraseto_tsquery and websearch_to_tsquery for converting a query to the
  80    tsquery data type. to_tsquery offers access to more features than
  81    either plainto_tsquery or phraseto_tsquery, but it is less forgiving
  82    about its input. websearch_to_tsquery is a simplified version of
  83    to_tsquery with an alternative syntax, similar to the one used by web
  84    search engines.
  85 to_tsquery([ config regconfig, ] querytext text) returns tsquery
  86
  87    to_tsquery creates a tsquery value from querytext, which must consist
  88    of single tokens separated by the tsquery operators & (AND), | (OR), !
  89    (NOT), and <-> (FOLLOWED BY), possibly grouped using parentheses. In
  90    other words, the input to to_tsquery must already follow the general
  91    rules for tsquery input, as described in Section 8.11.2. The difference
  92    is that while basic tsquery input takes the tokens at face value,
  93    to_tsquery normalizes each token into a lexeme using the specified or
  94    default configuration, and discards any tokens that are stop words
  95    according to the configuration. For example:
  96 SELECT to_tsquery('english', 'The & Fat & Rats');
  97   to_tsquery
  98 ---------------
  99  'fat' & 'rat'
 100
 101    As in basic tsquery input, weight(s) can be attached to each lexeme to
 102    restrict it to match only tsvector lexemes of those weight(s). For
 103    example:
 104 SELECT to_tsquery('english', 'Fat | Rats:AB');
 105     to_tsquery
 106 ------------------
 107  'fat' | 'rat':AB
 108
 109    Also, * can be attached to a lexeme to specify prefix matching:
 110 SELECT to_tsquery('supern:*A & star:A*B');
 111         to_tsquery
 112 --------------------------
 113  'supern':*A & 'star':*AB
 114
 115    Such a lexeme will match any word in a tsvector that begins with the
 116    given string.
 117
 118    to_tsquery can also accept single-quoted phrases. This is primarily
 119    useful when the configuration includes a thesaurus dictionary that may
 120    trigger on such phrases. In the example below, a thesaurus contains the
 121    rule supernovae stars : sn:
 122 SELECT to_tsquery('''supernovae stars'' & !crab');
 123   to_tsquery
 124 ---------------
 125  'sn' & !'crab'
 126
 127    Without quotes, to_tsquery will generate a syntax error for tokens that
 128    are not separated by an AND, OR, or FOLLOWED BY operator.
 129 plainto_tsquery([ config regconfig, ] querytext text) returns tsquery
 130
 131    plainto_tsquery transforms the unformatted text querytext to a tsquery
 132    value. The text is parsed and normalized much as for to_tsvector, then
 133    the & (AND) tsquery operator is inserted between surviving words.
 134
 135    Example:
 136 SELECT plainto_tsquery('english', 'The Fat Rats');
 137  plainto_tsquery
 138 -----------------
 139  'fat' & 'rat'
 140
 141    Note that plainto_tsquery will not recognize tsquery operators, weight
 142    labels, or prefix-match labels in its input:
 143 SELECT plainto_tsquery('english', 'The Fat & Rats:C');
 144    plainto_tsquery
 145 ---------------------
 146  'fat' & 'rat' & 'c'
 147
 148    Here, all the input punctuation was discarded.
 149 phraseto_tsquery([ config regconfig, ] querytext text) returns tsquery
 150
 151    phraseto_tsquery behaves much like plainto_tsquery, except that it
 152    inserts the <-> (FOLLOWED BY) operator between surviving words instead
 153    of the & (AND) operator. Also, stop words are not simply discarded, but
 154    are accounted for by inserting <N> operators rather than <-> operators.
 155    This function is useful when searching for exact lexeme sequences,
 156    since the FOLLOWED BY operators check lexeme order not just the
 157    presence of all the lexemes.
 158
 159    Example:
 160 SELECT phraseto_tsquery('english', 'The Fat Rats');
 161  phraseto_tsquery
 162 ------------------
 163  'fat' <-> 'rat'
 164
 165    Like plainto_tsquery, the phraseto_tsquery function will not recognize
 166    tsquery operators, weight labels, or prefix-match labels in its input:
 167 SELECT phraseto_tsquery('english', 'The Fat & Rats:C');
 168       phraseto_tsquery
 169 -----------------------------
 170  'fat' <-> 'rat' <-> 'c'
 171
 172 websearch_to_tsquery([ config regconfig, ] querytext text) returns tsquery
 173
 174    websearch_to_tsquery creates a tsquery value from querytext using an
 175    alternative syntax in which simple unformatted text is a valid query.
 176    Unlike plainto_tsquery and phraseto_tsquery, it also recognizes certain
 177    operators. Moreover, this function will never raise syntax errors,
 178    which makes it possible to use raw user-supplied input for search. The
 179    following syntax is supported:
 180      * unquoted text: text not inside quote marks will be converted to
 181        terms separated by & operators, as if processed by plainto_tsquery.
 182      * "quoted text": text inside quote marks will be converted to terms
 183        separated by <-> operators, as if processed by phraseto_tsquery.
 184      * OR: the word “or” will be converted to the | operator.
 185      * -: a dash will be converted to the ! operator.
 186
 187    Other punctuation is ignored. So like plainto_tsquery and
 188    phraseto_tsquery, the websearch_to_tsquery function will not recognize
 189    tsquery operators, weight labels, or prefix-match labels in its input.
 190
 191    Examples:
 192 SELECT websearch_to_tsquery('english', 'The fat rats');
 193  websearch_to_tsquery
 194 ----------------------
 195  'fat' & 'rat'
 196 (1 row)
 197
 198 SELECT websearch_to_tsquery('english', '"supernovae stars" -crab');
 199        websearch_to_tsquery
 200 ----------------------------------
 201  'supernova' <-> 'star' & !'crab'
 202 (1 row)
 203
 204 SELECT websearch_to_tsquery('english', '"sad cat" or "fat rat"');
 205        websearch_to_tsquery
 206 -----------------------------------
 207  'sad' <-> 'cat' | 'fat' <-> 'rat'
 208 (1 row)
 209
 210 SELECT websearch_to_tsquery('english', 'signal -"segmentation fault"');
 211          websearch_to_tsquery
 212 ---------------------------------------
 213  'signal' & !( 'segment' <-> 'fault' )
 214 (1 row)
 215
 216 SELECT websearch_to_tsquery('english', '""" )( dummy \\ query <->');
 217  websearch_to_tsquery
 218 ----------------------
 219  'dummi' & 'queri'
 220 (1 row)
 221
 222 12.3.3. Ranking Search Results #
 223
 224    Ranking attempts to measure how relevant documents are to a particular
 225    query, so that when there are many matches the most relevant ones can
 226    be shown first. PostgreSQL provides two predefined ranking functions,
 227    which take into account lexical, proximity, and structural information;
 228    that is, they consider how often the query terms appear in the
 229    document, how close together the terms are in the document, and how
 230    important is the part of the document where they occur. However, the
 231    concept of relevancy is vague and very application-specific. Different
 232    applications might require additional information for ranking, e.g.,
 233    document modification time. The built-in ranking functions are only
 234    examples. You can write your own ranking functions and/or combine their
 235    results with additional factors to fit your specific needs.
 236
 237    The two ranking functions currently available are:
 238
 239    ts_rank([ weights float4[], ] vector tsvector, query tsquery [,
 240           normalization integer ]) returns float4
 241           Ranks vectors based on the frequency of their matching lexemes.
 242
 243    ts_rank_cd([ weights float4[], ] vector tsvector, query tsquery [,
 244           normalization integer ]) returns float4
 245           This function computes the cover density ranking for the given
 246           document vector and query, as described in Clarke, Cormack, and
 247           Tudhope's "Relevance Ranking for One to Three Term Queries" in
 248           the journal "Information Processing and Management", 1999. Cover
 249           density is similar to ts_rank ranking except that the proximity
 250           of matching lexemes to each other is taken into consideration.
 251
 252           This function requires lexeme positional information to perform
 253           its calculation. Therefore, it ignores any “stripped” lexemes in
 254           the tsvector. If there are no unstripped lexemes in the input,
 255           the result will be zero. (See Section 12.4.1 for more
 256           information about the strip function and positional information
 257           in tsvectors.)
 258
 259    For both these functions, the optional weights argument offers the
 260    ability to weigh word instances more or less heavily depending on how
 261    they are labeled. The weight arrays specify how heavily to weigh each
 262    category of word, in the order:
 263 {D-weight, C-weight, B-weight, A-weight}
 264
 265    If no weights are provided, then these defaults are used:
 266 {0.1, 0.2, 0.4, 1.0}
 267
 268    Typically weights are used to mark words from special areas of the
 269    document, like the title or an initial abstract, so they can be treated
 270    with more or less importance than words in the document body.
 271
 272    Since a longer document has a greater chance of containing a query term
 273    it is reasonable to take into account document size, e.g., a
 274    hundred-word document with five instances of a search word is probably
 275    more relevant than a thousand-word document with five instances. Both
 276    ranking functions take an integer normalization option that specifies
 277    whether and how a document's length should impact its rank. The integer
 278    option controls several behaviors, so it is a bit mask: you can specify
 279    one or more behaviors using | (for example, 2|4).
 280      * 0 (the default) ignores the document length
 281      * 1 divides the rank by 1 + the logarithm of the document length
 282      * 2 divides the rank by the document length
 283      * 4 divides the rank by the mean harmonic distance between extents
 284        (this is implemented only by ts_rank_cd)
 285      * 8 divides the rank by the number of unique words in document
 286      * 16 divides the rank by 1 + the logarithm of the number of unique
 287        words in document
 288      * 32 divides the rank by itself + 1
 289
 290    If more than one flag bit is specified, the transformations are applied
 291    in the order listed.
 292
 293    It is important to note that the ranking functions do not use any
 294    global information, so it is impossible to produce a fair normalization
 295    to 1% or 100% as sometimes desired. Normalization option 32
 296    (rank/(rank+1)) can be applied to scale all ranks into the range zero
 297    to one, but of course this is just a cosmetic change; it will not
 298    affect the ordering of the search results.
 299
 300    Here is an example that selects only the ten highest-ranked matches:
 301 SELECT title, ts_rank_cd(textsearch, query) AS rank
 302 FROM apod, to_tsquery('neutrino|(dark & matter)') query
 303 WHERE query @@ textsearch
 304 ORDER BY rank DESC
 305 LIMIT 10;
 306                      title                     |   rank
 307 -----------------------------------------------+----------
 308  Neutrinos in the Sun                          |      3.1
 309  The Sudbury Neutrino Detector                 |      2.4
 310  A MACHO View of Galactic Dark Matter          |  2.01317
 311  Hot Gas and Dark Matter                       |  1.91171
 312  The Virgo Cluster: Hot Plasma and Dark Matter |  1.90953
 313  Rafting for Solar Neutrinos                   |      1.9
 314  NGC 4650A: Strange Galaxy and Dark Matter     |  1.85774
 315  Hot Gas and Dark Matter                       |   1.6123
 316  Ice Fishing for Cosmic Neutrinos              |      1.6
 317  Weak Lensing Distorts the Universe            | 0.818218
 318
 319    This is the same example using normalized ranking:
 320 SELECT title, ts_rank_cd(textsearch, query, 32 /* rank/(rank+1) */ ) AS rank
 321 FROM apod, to_tsquery('neutrino|(dark & matter)') query
 322 WHERE  query @@ textsearch
 323 ORDER BY rank DESC
 324 LIMIT 10;
 325                      title                     |        rank
 326 -----------------------------------------------+-------------------
 327  Neutrinos in the Sun                          | 0.756097569485493
 328  The Sudbury Neutrino Detector                 | 0.705882361190954
 329  A MACHO View of Galactic Dark Matter          | 0.668123210574724
 330  Hot Gas and Dark Matter                       |  0.65655958650282
 331  The Virgo Cluster: Hot Plasma and Dark Matter | 0.656301290640973
 332  Rafting for Solar Neutrinos                   | 0.655172410958162
 333  NGC 4650A: Strange Galaxy and Dark Matter     | 0.650072921219637
 334  Hot Gas and Dark Matter                       | 0.617195790024749
 335  Ice Fishing for Cosmic Neutrinos              | 0.615384618911517
 336  Weak Lensing Distorts the Universe            | 0.450010798361481
 337
 338    Ranking can be expensive since it requires consulting the tsvector of
 339    each matching document, which can be I/O bound and therefore slow.
 340    Unfortunately, it is almost impossible to avoid since practical queries
 341    often result in large numbers of matches.
 342
 343 12.3.4. Highlighting Results #
 344
 345    To present search results it is ideal to show a part of each document
 346    and how it is related to the query. Usually, search engines show
 347    fragments of the document with marked search terms. PostgreSQL provides
 348    a function ts_headline that implements this functionality.
 349 ts_headline([ config regconfig, ] document text, query tsquery [, options text ]
 350 ) returns text
 351
 352    ts_headline accepts a document along with a query, and returns an
 353    excerpt from the document in which terms from the query are
 354    highlighted. Specifically, the function will use the query to select
 355    relevant text fragments, and then highlight all words that appear in
 356    the query, even if those word positions do not match the query's
 357    restrictions. The configuration to be used to parse the document can be
 358    specified by config; if config is omitted, the
 359    default_text_search_config configuration is used.
 360
 361    If an options string is specified it must consist of a comma-separated
 362    list of one or more option=value pairs. The available options are:
 363      * MaxWords, MinWords (integers): these numbers determine the longest
 364        and shortest headlines to output. The default values are 35 and 15.
 365      * ShortWord (integer): words of this length or less will be dropped
 366        at the start and end of a headline, unless they are query terms.
 367        The default value of three eliminates common English articles.
 368      * HighlightAll (boolean): if true the whole document will be used as
 369        the headline, ignoring the preceding three parameters. The default
 370        is false.
 371      * MaxFragments (integer): maximum number of text fragments to
 372        display. The default value of zero selects a non-fragment-based
 373        headline generation method. A value greater than zero selects
 374        fragment-based headline generation (see below).
 375      * StartSel, StopSel (strings): the strings with which to delimit
 376        query words appearing in the document, to distinguish them from
 377        other excerpted words. The default values are “<b>” and “</b>”,
 378        which can be suitable for HTML output (but see the warning below).
 379      * FragmentDelimiter (string): When more than one fragment is
 380        displayed, the fragments will be separated by this string. The
 381        default is “ ... ”.
 382
 383 Warning: Cross-site Scripting (XSS) Safety
 384
 385    The output from ts_headline is not guaranteed to be safe for direct
 386    inclusion in web pages. When HighlightAll is false (the default), some
 387    simple XML tags are removed from the document, but this is not
 388    guaranteed to remove all HTML markup. Therefore, this does not provide
 389    an effective defense against attacks such as cross-site scripting (XSS)
 390    attacks, when working with untrusted input. To guard against such
 391    attacks, all HTML markup should be removed from the input document, or
 392    an HTML sanitizer should be used on the output.
 393
 394    These option names are recognized case-insensitively. You must
 395    double-quote string values if they contain spaces or commas.
 396
 397    In non-fragment-based headline generation, ts_headline locates matches
 398    for the given query and chooses a single one to display, preferring
 399    matches that have more query words within the allowed headline length.
 400    In fragment-based headline generation, ts_headline locates the query
 401    matches and splits each match into “fragments” of no more than MaxWords
 402    words each, preferring fragments with more query words, and when
 403    possible “stretching” fragments to include surrounding words. The
 404    fragment-based mode is thus more useful when the query matches span
 405    large sections of the document, or when it's desirable to display
 406    multiple matches. In either mode, if no query matches can be
 407    identified, then a single fragment of the first MinWords words in the
 408    document will be displayed.
 409
 410    For example:
 411 SELECT ts_headline('english',
 412   'The most common type of search
 413 is to find all documents containing given query terms
 414 and return them in order of their similarity to the
 415 query.',
 416   to_tsquery('english', 'query & similarity'));
 417                         ts_headline
 418 ------------------------------------------------------------
 419  containing given <b>query</b> terms                       +
 420  and return them in order of their <b>similarity</b> to the+
 421  <b>query</b>.
 422
 423 SELECT ts_headline('english',
 424   'Search terms may occur
 425 many times in a document,
 426 requiring ranking of the search matches to decide which
 427 occurrences to display in the result.',
 428   to_tsquery('english', 'search & term'),
 429   'MaxFragments=10, MaxWords=7, MinWords=3, StartSel=<<, StopSel=>>');
 430                         ts_headline
 431 ------------------------------------------------------------
 432  <<Search>> <<terms>> may occur                            +
 433  many times ... ranking of the <<search>> matches to decide
 434
 435    ts_headline uses the original document, not a tsvector summary, so it
 436    can be slow and should be used with care.