begriffs open source - ai-pg/blob - full-docs/txt/datatype-textsearch.txt

   1
   2 8.11. Text Search Types #
   3
   4    8.11.1. tsvector
   5    8.11.2. tsquery
   6
   7    PostgreSQL provides two data types that are designed to support full
   8    text search, which is the activity of searching through a collection of
   9    natural-language documents to locate those that best match a query. The
  10    tsvector type represents a document in a form optimized for text
  11    search; the tsquery type similarly represents a text query. Chapter 12
  12    provides a detailed explanation of this facility, and Section 9.13
  13    summarizes the related functions and operators.
  14
  15 8.11.1. tsvector #
  16
  17    A tsvector value is a sorted list of distinct lexemes, which are words
  18    that have been normalized to merge different variants of the same word
  19    (see Chapter 12 for details). Sorting and duplicate-elimination are
  20    done automatically during input, as shown in this example:
  21 SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector;
  22                       tsvector
  23 ----------------------------------------------------
  24  'a' 'and' 'ate' 'cat' 'fat' 'mat' 'on' 'rat' 'sat'
  25
  26    To represent lexemes containing whitespace or punctuation, surround
  27    them with quotes:
  28 SELECT $$the lexeme '    ' contains spaces$$::tsvector;
  29                  tsvector
  30 -------------------------------------------
  31  '    ' 'contains' 'lexeme' 'spaces' 'the'
  32
  33    (We use dollar-quoted string literals in this example and the next one
  34    to avoid the confusion of having to double quote marks within the
  35    literals.) Embedded quotes and backslashes must be doubled:
  36 SELECT $$the lexeme 'Joe''s' contains a quote$$::tsvector;
  37                     tsvector
  38 ------------------------------------------------
  39  'Joe''s' 'a' 'contains' 'lexeme' 'quote' 'the'
  40
  41    Optionally, integer positions can be attached to lexemes:
  42 SELECT 'a:1 fat:2 cat:3 sat:4 on:5 a:6 mat:7 and:8 ate:9 a:10 fat:11 rat:12'::ts
  43 vector;
  44                                   tsvector
  45 -------------------------------------------------------------------------------
  46  'a':1,6,10 'and':8 'ate':9 'cat':3 'fat':2,11 'mat':7 'on':5 'rat':12 'sat':4
  47
  48    A position normally indicates the source word's location in the
  49    document. Positional information can be used for proximity ranking.
  50    Position values can range from 1 to 16383; larger numbers are silently
  51    set to 16383. Duplicate positions for the same lexeme are discarded.
  52
  53    Lexemes that have positions can further be labeled with a weight, which
  54    can be A, B, C, or D. D is the default and hence is not shown on
  55    output:
  56 SELECT 'a:1A fat:2B,4C cat:5D'::tsvector;
  57           tsvector
  58 ----------------------------
  59  'a':1A 'cat':5 'fat':2B,4C
  60
  61    Weights are typically used to reflect document structure, for example
  62    by marking title words differently from body words. Text search ranking
  63    functions can assign different priorities to the different weight
  64    markers.
  65
  66    It is important to understand that the tsvector type itself does not
  67    perform any word normalization; it assumes the words it is given are
  68    normalized appropriately for the application. For example,
  69 SELECT 'The Fat Rats'::tsvector;
  70       tsvector
  71 --------------------
  72  'Fat' 'Rats' 'The'
  73
  74    For most English-text-searching applications the above words would be
  75    considered non-normalized, but tsvector doesn't care. Raw document text
  76    should usually be passed through to_tsvector to normalize the words
  77    appropriately for searching:
  78 SELECT to_tsvector('english', 'The Fat Rats');
  79    to_tsvector
  80 -----------------
  81  'fat':2 'rat':3
  82
  83    Again, see Chapter 12 for more detail.
  84
  85 8.11.2. tsquery #
  86
  87    A tsquery value stores lexemes that are to be searched for, and can
  88    combine them using the Boolean operators & (AND), | (OR), and ! (NOT),
  89    as well as the phrase search operator <-> (FOLLOWED BY). There is also
  90    a variant <N> of the FOLLOWED BY operator, where N is an integer
  91    constant that specifies the distance between the two lexemes being
  92    searched for. <-> is equivalent to <1>.
  93
  94    Parentheses can be used to enforce grouping of these operators. In the
  95    absence of parentheses, ! (NOT) binds most tightly, <-> (FOLLOWED BY)
  96    next most tightly, then & (AND), with | (OR) binding the least tightly.
  97
  98    Here are some examples:
  99 SELECT 'fat & rat'::tsquery;
 100     tsquery
 101 ---------------
 102  'fat' & 'rat'
 103
 104 SELECT 'fat & (rat | cat)'::tsquery;
 105           tsquery
 106 ---------------------------
 107  'fat' & ( 'rat' | 'cat' )
 108
 109 SELECT 'fat & rat & ! cat'::tsquery;
 110         tsquery
 111 ------------------------
 112  'fat' & 'rat' & !'cat'
 113
 114    Optionally, lexemes in a tsquery can be labeled with one or more weight
 115    letters, which restricts them to match only tsvector lexemes with one
 116    of those weights:
 117 SELECT 'fat:ab & cat'::tsquery;
 118     tsquery
 119 ------------------
 120  'fat':AB & 'cat'
 121
 122    Also, lexemes in a tsquery can be labeled with * to specify prefix
 123    matching:
 124 SELECT 'super:*'::tsquery;
 125   tsquery
 126 -----------
 127  'super':*
 128
 129    This query will match any word in a tsvector that begins with “super”.
 130
 131    Quoting rules for lexemes are the same as described previously for
 132    lexemes in tsvector; and, as with tsvector, any required normalization
 133    of words must be done before converting to the tsquery type. The
 134    to_tsquery function is convenient for performing such normalization:
 135 SELECT to_tsquery('Fat:ab & Cats');
 136     to_tsquery
 137 ------------------
 138  'fat':AB & 'cat'
 139
 140    Note that to_tsquery will process prefixes in the same way as other
 141    words, which means this comparison returns true:
 142 SELECT to_tsvector( 'postgraduate' ) @@ to_tsquery( 'postgres:*' );
 143  ?column?
 144 ----------
 145  t
 146
 147    because postgres gets stemmed to postgr:
 148 SELECT to_tsvector( 'postgraduate' ), to_tsquery( 'postgres:*' );
 149   to_tsvector  | to_tsquery
 150 ---------------+------------
 151  'postgradu':1 | 'postgr':*
 152
 153    which will match the stemmed form of postgraduate.