2 8.11. Text Search Types #
7 PostgreSQL provides two data types that are designed to support full
8 text search, which is the activity of searching through a collection of
9 natural-language documents to locate those that best match a query. The
10 tsvector type represents a document in a form optimized for text
11 search; the tsquery type similarly represents a text query. Chapter 12
12 provides a detailed explanation of this facility, and Section 9.13
13 summarizes the related functions and operators.
17 A tsvector value is a sorted list of distinct lexemes, which are words
18 that have been normalized to merge different variants of the same word
19 (see Chapter 12 for details). Sorting and duplicate-elimination are
20 done automatically during input, as shown in this example:
21 SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector;
23 ----------------------------------------------------
24 'a' 'and' 'ate' 'cat' 'fat' 'mat' 'on' 'rat' 'sat'
26 To represent lexemes containing whitespace or punctuation, surround
28 SELECT $$the lexeme ' ' contains spaces$$::tsvector;
30 -------------------------------------------
31 ' ' 'contains' 'lexeme' 'spaces' 'the'
33 (We use dollar-quoted string literals in this example and the next one
34 to avoid the confusion of having to double quote marks within the
35 literals.) Embedded quotes and backslashes must be doubled:
36 SELECT $$the lexeme 'Joe''s' contains a quote$$::tsvector;
38 ------------------------------------------------
39 'Joe''s' 'a' 'contains' 'lexeme' 'quote' 'the'
41 Optionally, integer positions can be attached to lexemes:
42 SELECT 'a:1 fat:2 cat:3 sat:4 on:5 a:6 mat:7 and:8 ate:9 a:10 fat:11 rat:12'::ts
45 -------------------------------------------------------------------------------
46 'a':1,6,10 'and':8 'ate':9 'cat':3 'fat':2,11 'mat':7 'on':5 'rat':12 'sat':4
48 A position normally indicates the source word's location in the
49 document. Positional information can be used for proximity ranking.
50 Position values can range from 1 to 16383; larger numbers are silently
51 set to 16383. Duplicate positions for the same lexeme are discarded.
53 Lexemes that have positions can further be labeled with a weight, which
54 can be A, B, C, or D. D is the default and hence is not shown on
56 SELECT 'a:1A fat:2B,4C cat:5D'::tsvector;
58 ----------------------------
59 'a':1A 'cat':5 'fat':2B,4C
61 Weights are typically used to reflect document structure, for example
62 by marking title words differently from body words. Text search ranking
63 functions can assign different priorities to the different weight
66 It is important to understand that the tsvector type itself does not
67 perform any word normalization; it assumes the words it is given are
68 normalized appropriately for the application. For example,
69 SELECT 'The Fat Rats'::tsvector;
74 For most English-text-searching applications the above words would be
75 considered non-normalized, but tsvector doesn't care. Raw document text
76 should usually be passed through to_tsvector to normalize the words
77 appropriately for searching:
78 SELECT to_tsvector('english', 'The Fat Rats');
83 Again, see Chapter 12 for more detail.
87 A tsquery value stores lexemes that are to be searched for, and can
88 combine them using the Boolean operators & (AND), | (OR), and ! (NOT),
89 as well as the phrase search operator <-> (FOLLOWED BY). There is also
90 a variant <N> of the FOLLOWED BY operator, where N is an integer
91 constant that specifies the distance between the two lexemes being
92 searched for. <-> is equivalent to <1>.
94 Parentheses can be used to enforce grouping of these operators. In the
95 absence of parentheses, ! (NOT) binds most tightly, <-> (FOLLOWED BY)
96 next most tightly, then & (AND), with | (OR) binding the least tightly.
98 Here are some examples:
99 SELECT 'fat & rat'::tsquery;
104 SELECT 'fat & (rat | cat)'::tsquery;
106 ---------------------------
107 'fat' & ( 'rat' | 'cat' )
109 SELECT 'fat & rat & ! cat'::tsquery;
111 ------------------------
112 'fat' & 'rat' & !'cat'
114 Optionally, lexemes in a tsquery can be labeled with one or more weight
115 letters, which restricts them to match only tsvector lexemes with one
117 SELECT 'fat:ab & cat'::tsquery;
122 Also, lexemes in a tsquery can be labeled with * to specify prefix
124 SELECT 'super:*'::tsquery;
129 This query will match any word in a tsvector that begins with “super”.
131 Quoting rules for lexemes are the same as described previously for
132 lexemes in tsvector; and, as with tsvector, any required normalization
133 of words must be done before converting to the tsquery type. The
134 to_tsquery function is convenient for performing such normalization:
135 SELECT to_tsquery('Fat:ab & Cats');
140 Note that to_tsquery will process prefixes in the same way as other
141 words, which means this comparison returns true:
142 SELECT to_tsvector( 'postgraduate' ) @@ to_tsquery( 'postgres:*' );
147 because postgres gets stemmed to postgr:
148 SELECT to_tsvector( 'postgraduate' ), to_tsquery( 'postgres:*' );
149 to_tsvector | to_tsquery
150 ---------------+------------
151 'postgradu':1 | 'postgr':*
153 which will match the stemmed form of postgraduate.