1 <?xml version="1.0" encoding="UTF-8" standalone="no"?>
2 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>12.3. Controlling Text Search</title><link rel="stylesheet" type="text/css" href="stylesheet.css" /><link rev="made" href="pgsql-docs@lists.postgresql.org" /><meta name="generator" content="DocBook XSL Stylesheets Vsnapshot" /><link rel="prev" href="textsearch-tables.html" title="12.2. Tables and Indexes" /><link rel="next" href="textsearch-features.html" title="12.4. Additional Features" /></head><body id="docContent" class="container-fluid col-10"><div class="navheader"><table width="100%" summary="Navigation header"><tr><th colspan="5" align="center">12.3. Controlling Text Search</th></tr><tr><td width="10%" align="left"><a accesskey="p" href="textsearch-tables.html" title="12.2. Tables and Indexes">Prev</a> </td><td width="10%" align="left"><a accesskey="u" href="textsearch.html" title="Chapter 12. Full Text Search">Up</a></td><th width="60%" align="center">Chapter 12. Full Text Search</th><td width="10%" align="right"><a accesskey="h" href="index.html" title="PostgreSQL 18.0 Documentation">Home</a></td><td width="10%" align="right"> <a accesskey="n" href="textsearch-features.html" title="12.4. Additional Features">Next</a></td></tr></table><hr /></div><div class="sect1" id="TEXTSEARCH-CONTROLS"><div class="titlepage"><div><div><h2 class="title" style="clear: both">12.3. Controlling Text Search <a href="#TEXTSEARCH-CONTROLS" class="id_link">#</a></h2></div></div></div><div class="toc"><dl class="toc"><dt><span class="sect2"><a href="textsearch-controls.html#TEXTSEARCH-PARSING-DOCUMENTS">12.3.1. Parsing Documents</a></span></dt><dt><span class="sect2"><a href="textsearch-controls.html#TEXTSEARCH-PARSING-QUERIES">12.3.2. Parsing Queries</a></span></dt><dt><span class="sect2"><a href="textsearch-controls.html#TEXTSEARCH-RANKING">12.3.3. Ranking Search Results</a></span></dt><dt><span class="sect2"><a href="textsearch-controls.html#TEXTSEARCH-HEADLINE">12.3.4. Highlighting Results</a></span></dt></dl></div><p>
3 To implement full text searching there must be a function to create a
4 <code class="type">tsvector</code> from a document and a <code class="type">tsquery</code> from a
5 user query. Also, we need to return results in a useful order, so we need
6 a function that compares documents with respect to their relevance to
7 the query. It's also important to be able to display the results nicely.
8 <span class="productname">PostgreSQL</span> provides support for all of these
10 </p><div class="sect2" id="TEXTSEARCH-PARSING-DOCUMENTS"><div class="titlepage"><div><div><h3 class="title">12.3.1. Parsing Documents <a href="#TEXTSEARCH-PARSING-DOCUMENTS" class="id_link">#</a></h3></div></div></div><p>
11 <span class="productname">PostgreSQL</span> provides the
12 function <code class="function">to_tsvector</code> for converting a document to
13 the <code class="type">tsvector</code> data type.
14 </p><a id="id-1.5.11.6.3.3" class="indexterm"></a><pre class="synopsis">
15 to_tsvector([<span class="optional"> <em class="replaceable"><code>config</code></em> <code class="type">regconfig</code>, </span>] <em class="replaceable"><code>document</code></em> <code class="type">text</code>) returns <code class="type">tsvector</code>
17 <code class="function">to_tsvector</code> parses a textual document into tokens,
18 reduces the tokens to lexemes, and returns a <code class="type">tsvector</code> which
19 lists the lexemes together with their positions in the document.
20 The document is processed according to the specified or default
21 text search configuration.
22 Here is a simple example:
24 </p><pre class="screen">
25 SELECT to_tsvector('english', 'a fat cat sat on a mat - it ate a fat rats');
27 -----------------------------------------------------
28 'ate':9 'cat':3 'fat':2,11 'mat':7 'rat':12 'sat':4
31 In the example above we see that the resulting <code class="type">tsvector</code> does not
32 contain the words <code class="literal">a</code>, <code class="literal">on</code>, or
33 <code class="literal">it</code>, the word <code class="literal">rats</code> became
34 <code class="literal">rat</code>, and the punctuation sign <code class="literal">-</code> was
37 The <code class="function">to_tsvector</code> function internally calls a parser
38 which breaks the document text into tokens and assigns a type to
39 each token. For each token, a list of
40 dictionaries (<a class="xref" href="textsearch-dictionaries.html" title="12.6. Dictionaries">Section 12.6</a>) is consulted,
41 where the list can vary depending on the token type. The first dictionary
42 that <em class="firstterm">recognizes</em> the token emits one or more normalized
43 <em class="firstterm">lexemes</em> to represent the token. For example,
44 <code class="literal">rats</code> became <code class="literal">rat</code> because one of the
45 dictionaries recognized that the word <code class="literal">rats</code> is a plural
46 form of <code class="literal">rat</code>. Some words are recognized as
47 <em class="firstterm">stop words</em> (<a class="xref" href="textsearch-dictionaries.html#TEXTSEARCH-STOPWORDS" title="12.6.1. Stop Words">Section 12.6.1</a>), which
48 causes them to be ignored since they occur too frequently to be useful in
49 searching. In our example these are
50 <code class="literal">a</code>, <code class="literal">on</code>, and <code class="literal">it</code>.
51 If no dictionary in the list recognizes the token then it is also ignored.
52 In this example that happened to the punctuation sign <code class="literal">-</code>
53 because there are in fact no dictionaries assigned for its token type
54 (<code class="literal">Space symbols</code>), meaning space tokens will never be
55 indexed. The choices of parser, dictionaries and which types of tokens to
56 index are determined by the selected text search configuration (<a class="xref" href="textsearch-configuration.html" title="12.7. Configuration Example">Section 12.7</a>). It is possible to have
57 many different configurations in the same database, and predefined
58 configurations are available for various languages. In our example
59 we used the default configuration <code class="literal">english</code> for the
62 The function <code class="function">setweight</code> can be used to label the
63 entries of a <code class="type">tsvector</code> with a given <em class="firstterm">weight</em>,
64 where a weight is one of the letters <code class="literal">A</code>, <code class="literal">B</code>,
65 <code class="literal">C</code>, or <code class="literal">D</code>.
66 This is typically used to mark entries coming from
67 different parts of a document, such as title versus body. Later, this
68 information can be used for ranking of search results.
70 Because <code class="function">to_tsvector</code>(<code class="literal">NULL</code>) will
71 return <code class="literal">NULL</code>, it is recommended to use
72 <code class="function">coalesce</code> whenever a field might be null.
73 Here is the recommended method for creating
74 a <code class="type">tsvector</code> from a structured document:
76 </p><pre class="programlisting">
78 setweight(to_tsvector(coalesce(title,'')), 'A') ||
79 setweight(to_tsvector(coalesce(keyword,'')), 'B') ||
80 setweight(to_tsvector(coalesce(abstract,'')), 'C') ||
81 setweight(to_tsvector(coalesce(body,'')), 'D');
84 Here we have used <code class="function">setweight</code> to label the source
85 of each lexeme in the finished <code class="type">tsvector</code>, and then merged
86 the labeled <code class="type">tsvector</code> values using the <code class="type">tsvector</code>
87 concatenation operator <code class="literal">||</code>. (<a class="xref" href="textsearch-features.html#TEXTSEARCH-MANIPULATE-TSVECTOR" title="12.4.1. Manipulating Documents">Section 12.4.1</a> gives details about these
89 </p></div><div class="sect2" id="TEXTSEARCH-PARSING-QUERIES"><div class="titlepage"><div><div><h3 class="title">12.3.2. Parsing Queries <a href="#TEXTSEARCH-PARSING-QUERIES" class="id_link">#</a></h3></div></div></div><p>
90 <span class="productname">PostgreSQL</span> provides the
91 functions <code class="function">to_tsquery</code>,
92 <code class="function">plainto_tsquery</code>,
93 <code class="function">phraseto_tsquery</code> and
94 <code class="function">websearch_to_tsquery</code>
95 for converting a query to the <code class="type">tsquery</code> data type.
96 <code class="function">to_tsquery</code> offers access to more features
97 than either <code class="function">plainto_tsquery</code> or
98 <code class="function">phraseto_tsquery</code>, but it is less forgiving about its
99 input. <code class="function">websearch_to_tsquery</code> is a simplified version
100 of <code class="function">to_tsquery</code> with an alternative syntax, similar
101 to the one used by web search engines.
102 </p><a id="id-1.5.11.6.4.3" class="indexterm"></a><pre class="synopsis">
103 to_tsquery([<span class="optional"> <em class="replaceable"><code>config</code></em> <code class="type">regconfig</code>, </span>] <em class="replaceable"><code>querytext</code></em> <code class="type">text</code>) returns <code class="type">tsquery</code>
105 <code class="function">to_tsquery</code> creates a <code class="type">tsquery</code> value from
106 <em class="replaceable"><code>querytext</code></em>, which must consist of single tokens
107 separated by the <code class="type">tsquery</code> operators <code class="literal">&</code> (AND),
108 <code class="literal">|</code> (OR), <code class="literal">!</code> (NOT), and
109 <code class="literal"><-></code> (FOLLOWED BY), possibly grouped
110 using parentheses. In other words, the input to
111 <code class="function">to_tsquery</code> must already follow the general rules for
112 <code class="type">tsquery</code> input, as described in <a class="xref" href="datatype-textsearch.html#DATATYPE-TSQUERY" title="8.11.2. tsquery">Section 8.11.2</a>. The difference is that while basic
113 <code class="type">tsquery</code> input takes the tokens at face value,
114 <code class="function">to_tsquery</code> normalizes each token into a lexeme using
115 the specified or default configuration, and discards any tokens that are
116 stop words according to the configuration. For example:
118 </p><pre class="screen">
119 SELECT to_tsquery('english', 'The & Fat & Rats');
125 As in basic <code class="type">tsquery</code> input, weight(s) can be attached to each
126 lexeme to restrict it to match only <code class="type">tsvector</code> lexemes of those
127 weight(s). For example:
129 </p><pre class="screen">
130 SELECT to_tsquery('english', 'Fat | Rats:AB');
136 Also, <code class="literal">*</code> can be attached to a lexeme to specify prefix matching:
138 </p><pre class="screen">
139 SELECT to_tsquery('supern:*A & star:A*B');
141 --------------------------
142 'supern':*A & 'star':*AB
145 Such a lexeme will match any word in a <code class="type">tsvector</code> that begins
146 with the given string.
148 <code class="function">to_tsquery</code> can also accept single-quoted
149 phrases. This is primarily useful when the configuration includes a
150 thesaurus dictionary that may trigger on such phrases.
151 In the example below, a thesaurus contains the rule <code class="literal">supernovae
154 </p><pre class="screen">
155 SELECT to_tsquery('''supernovae stars'' & !crab');
161 Without quotes, <code class="function">to_tsquery</code> will generate a syntax
162 error for tokens that are not separated by an AND, OR, or FOLLOWED BY
164 </p><a id="id-1.5.11.6.4.7" class="indexterm"></a><pre class="synopsis">
165 plainto_tsquery([<span class="optional"> <em class="replaceable"><code>config</code></em> <code class="type">regconfig</code>, </span>] <em class="replaceable"><code>querytext</code></em> <code class="type">text</code>) returns <code class="type">tsquery</code>
167 <code class="function">plainto_tsquery</code> transforms the unformatted text
168 <em class="replaceable"><code>querytext</code></em> to a <code class="type">tsquery</code> value.
169 The text is parsed and normalized much as for <code class="function">to_tsvector</code>,
170 then the <code class="literal">&</code> (AND) <code class="type">tsquery</code> operator is
171 inserted between surviving words.
175 </p><pre class="screen">
176 SELECT plainto_tsquery('english', 'The Fat Rats');
182 Note that <code class="function">plainto_tsquery</code> will not
183 recognize <code class="type">tsquery</code> operators, weight labels,
184 or prefix-match labels in its input:
186 </p><pre class="screen">
187 SELECT plainto_tsquery('english', 'The Fat & Rats:C');
189 ---------------------
190 'fat' & 'rat' & 'c'
193 Here, all the input punctuation was discarded.
194 </p><a id="id-1.5.11.6.4.11" class="indexterm"></a><pre class="synopsis">
195 phraseto_tsquery([<span class="optional"> <em class="replaceable"><code>config</code></em> <code class="type">regconfig</code>, </span>] <em class="replaceable"><code>querytext</code></em> <code class="type">text</code>) returns <code class="type">tsquery</code>
197 <code class="function">phraseto_tsquery</code> behaves much like
198 <code class="function">plainto_tsquery</code>, except that it inserts
199 the <code class="literal"><-></code> (FOLLOWED BY) operator between
200 surviving words instead of the <code class="literal">&</code> (AND) operator.
201 Also, stop words are not simply discarded, but are accounted for by
202 inserting <code class="literal"><<em class="replaceable"><code>N</code></em>></code> operators rather
203 than <code class="literal"><-></code> operators. This function is useful
204 when searching for exact lexeme sequences, since the FOLLOWED BY
205 operators check lexeme order not just the presence of all the lexemes.
209 </p><pre class="screen">
210 SELECT phraseto_tsquery('english', 'The Fat Rats');
213 'fat' <-> 'rat'
216 Like <code class="function">plainto_tsquery</code>, the
217 <code class="function">phraseto_tsquery</code> function will not
218 recognize <code class="type">tsquery</code> operators, weight labels,
219 or prefix-match labels in its input:
221 </p><pre class="screen">
222 SELECT phraseto_tsquery('english', 'The Fat & Rats:C');
224 -----------------------------
225 'fat' <-> 'rat' <-> 'c'
227 </p><pre class="synopsis">
228 websearch_to_tsquery([<span class="optional"> <em class="replaceable"><code>config</code></em> <code class="type">regconfig</code>, </span>] <em class="replaceable"><code>querytext</code></em> <code class="type">text</code>) returns <code class="type">tsquery</code>
230 <code class="function">websearch_to_tsquery</code> creates a <code class="type">tsquery</code>
231 value from <em class="replaceable"><code>querytext</code></em> using an alternative
232 syntax in which simple unformatted text is a valid query.
233 Unlike <code class="function">plainto_tsquery</code>
234 and <code class="function">phraseto_tsquery</code>, it also recognizes certain
235 operators. Moreover, this function will never raise syntax errors,
236 which makes it possible to use raw user-supplied input for search.
237 The following syntax is supported:
239 </p><div class="itemizedlist"><ul class="itemizedlist compact" style="list-style-type: bullet; "><li class="listitem" style="list-style-type: disc"><p>
240 <code class="literal">unquoted text</code>: text not inside quote marks will be
241 converted to terms separated by <code class="literal">&</code> operators, as
242 if processed by <code class="function">plainto_tsquery</code>.
243 </p></li><li class="listitem" style="list-style-type: disc"><p>
244 <code class="literal">"quoted text"</code>: text inside quote marks will be
245 converted to terms separated by <code class="literal"><-></code>
246 operators, as if processed by <code class="function">phraseto_tsquery</code>.
247 </p></li><li class="listitem" style="list-style-type: disc"><p>
248 <code class="literal">OR</code>: the word <span class="quote">“<span class="quote">or</span>”</span> will be converted to
249 the <code class="literal">|</code> operator.
250 </p></li><li class="listitem" style="list-style-type: disc"><p>
251 <code class="literal">-</code>: a dash will be converted to
252 the <code class="literal">!</code> operator.
253 </p></li></ul></div><p>
255 Other punctuation is ignored. So
256 like <code class="function">plainto_tsquery</code>
257 and <code class="function">phraseto_tsquery</code>,
258 the <code class="function">websearch_to_tsquery</code> function will not
259 recognize <code class="type">tsquery</code> operators, weight labels, or prefix-match
263 </p><pre class="screen">
264 SELECT websearch_to_tsquery('english', 'The fat rats');
266 ----------------------
270 SELECT websearch_to_tsquery('english', '"supernovae stars" -crab');
272 ----------------------------------
273 'supernova' <-> 'star' & !'crab'
276 SELECT websearch_to_tsquery('english', '"sad cat" or "fat rat"');
278 -----------------------------------
279 'sad' <-> 'cat' | 'fat' <-> 'rat'
282 SELECT websearch_to_tsquery('english', 'signal -"segmentation fault"');
284 ---------------------------------------
285 'signal' & !( 'segment' <-> 'fault' )
288 SELECT websearch_to_tsquery('english', '""" )( dummy \\ query <->');
290 ----------------------
291 'dummi' & 'queri'
294 </p></div><div class="sect2" id="TEXTSEARCH-RANKING"><div class="titlepage"><div><div><h3 class="title">12.3.3. Ranking Search Results <a href="#TEXTSEARCH-RANKING" class="id_link">#</a></h3></div></div></div><p>
295 Ranking attempts to measure how relevant documents are to a particular
296 query, so that when there are many matches the most relevant ones can be
297 shown first. <span class="productname">PostgreSQL</span> provides two
298 predefined ranking functions, which take into account lexical, proximity,
299 and structural information; that is, they consider how often the query
300 terms appear in the document, how close together the terms are in the
301 document, and how important is the part of the document where they occur.
302 However, the concept of relevancy is vague and very application-specific.
303 Different applications might require additional information for ranking,
304 e.g., document modification time. The built-in ranking functions are only
305 examples. You can write your own ranking functions and/or combine their
306 results with additional factors to fit your specific needs.
308 The two ranking functions currently available are:
310 </p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
311 <a id="id-1.5.11.6.5.3.1.1.1.1" class="indexterm"></a>
313 <code class="literal">ts_rank([<span class="optional"> <em class="replaceable"><code>weights</code></em> <code class="type">float4[]</code>, </span>] <em class="replaceable"><code>vector</code></em> <code class="type">tsvector</code>, <em class="replaceable"><code>query</code></em> <code class="type">tsquery</code> [<span class="optional">, <em class="replaceable"><code>normalization</code></em> <code class="type">integer</code> </span>]) returns <code class="type">float4</code></code>
315 Ranks vectors based on the frequency of their matching lexemes.
316 </p></dd><dt><span class="term">
317 <a id="id-1.5.11.6.5.3.1.2.1.1" class="indexterm"></a>
319 <code class="literal">ts_rank_cd([<span class="optional"> <em class="replaceable"><code>weights</code></em> <code class="type">float4[]</code>, </span>] <em class="replaceable"><code>vector</code></em> <code class="type">tsvector</code>, <em class="replaceable"><code>query</code></em> <code class="type">tsquery</code> [<span class="optional">, <em class="replaceable"><code>normalization</code></em> <code class="type">integer</code> </span>]) returns <code class="type">float4</code></code>
321 This function computes the <em class="firstterm">cover density</em>
322 ranking for the given document vector and query, as described in
323 Clarke, Cormack, and Tudhope's "Relevance Ranking for One to Three
324 Term Queries" in the journal "Information Processing and Management",
325 1999. Cover density is similar to <code class="function">ts_rank</code> ranking
326 except that the proximity of matching lexemes to each other is
327 taken into consideration.
329 This function requires lexeme positional information to perform
330 its calculation. Therefore, it ignores any <span class="quote">“<span class="quote">stripped</span>”</span>
331 lexemes in the <code class="type">tsvector</code>. If there are no unstripped
332 lexemes in the input, the result will be zero. (See <a class="xref" href="textsearch-features.html#TEXTSEARCH-MANIPULATE-TSVECTOR" title="12.4.1. Manipulating Documents">Section 12.4.1</a> for more information
333 about the <code class="function">strip</code> function and positional information
334 in <code class="type">tsvector</code>s.)
335 </p></dd></dl></div><p>
338 For both these functions,
339 the optional <em class="replaceable"><code>weights</code></em>
340 argument offers the ability to weigh word instances more or less
341 heavily depending on how they are labeled. The weight arrays specify
342 how heavily to weigh each category of word, in the order:
344 </p><pre class="synopsis">
345 {D-weight, C-weight, B-weight, A-weight}
348 If no <em class="replaceable"><code>weights</code></em> are provided,
349 then these defaults are used:
351 </p><pre class="programlisting">
355 Typically weights are used to mark words from special areas of the
356 document, like the title or an initial abstract, so they can be
357 treated with more or less importance than words in the document body.
359 Since a longer document has a greater chance of containing a query term
360 it is reasonable to take into account document size, e.g., a hundred-word
361 document with five instances of a search word is probably more relevant
362 than a thousand-word document with five instances. Both ranking functions
363 take an integer <em class="replaceable"><code>normalization</code></em> option that
364 specifies whether and how a document's length should impact its rank.
365 The integer option controls several behaviors, so it is a bit mask:
366 you can specify one or more behaviors using
367 <code class="literal">|</code> (for example, <code class="literal">2|4</code>).
369 </p><div class="itemizedlist"><ul class="itemizedlist compact" style="list-style-type: bullet; "><li class="listitem" style="list-style-type: disc"><p>
370 0 (the default) ignores the document length
371 </p></li><li class="listitem" style="list-style-type: disc"><p>
372 1 divides the rank by 1 + the logarithm of the document length
373 </p></li><li class="listitem" style="list-style-type: disc"><p>
374 2 divides the rank by the document length
375 </p></li><li class="listitem" style="list-style-type: disc"><p>
376 4 divides the rank by the mean harmonic distance between extents
377 (this is implemented only by <code class="function">ts_rank_cd</code>)
378 </p></li><li class="listitem" style="list-style-type: disc"><p>
379 8 divides the rank by the number of unique words in document
380 </p></li><li class="listitem" style="list-style-type: disc"><p>
381 16 divides the rank by 1 + the logarithm of the number
382 of unique words in document
383 </p></li><li class="listitem" style="list-style-type: disc"><p>
384 32 divides the rank by itself + 1
385 </p></li></ul></div><p>
387 If more than one flag bit is specified, the transformations are
388 applied in the order listed.
390 It is important to note that the ranking functions do not use any global
391 information, so it is impossible to produce a fair normalization to 1% or
392 100% as sometimes desired. Normalization option 32
393 (<code class="literal">rank/(rank+1)</code>) can be applied to scale all ranks
394 into the range zero to one, but of course this is just a cosmetic change;
395 it will not affect the ordering of the search results.
397 Here is an example that selects only the ten highest-ranked matches:
399 </p><pre class="screen">
400 SELECT title, ts_rank_cd(textsearch, query) AS rank
401 FROM apod, to_tsquery('neutrino|(dark & matter)') query
402 WHERE query @@ textsearch
406 -----------------------------------------------+----------
407 Neutrinos in the Sun | 3.1
408 The Sudbury Neutrino Detector | 2.4
409 A MACHO View of Galactic Dark Matter | 2.01317
410 Hot Gas and Dark Matter | 1.91171
411 The Virgo Cluster: Hot Plasma and Dark Matter | 1.90953
412 Rafting for Solar Neutrinos | 1.9
413 NGC 4650A: Strange Galaxy and Dark Matter | 1.85774
414 Hot Gas and Dark Matter | 1.6123
415 Ice Fishing for Cosmic Neutrinos | 1.6
416 Weak Lensing Distorts the Universe | 0.818218
419 This is the same example using normalized ranking:
421 </p><pre class="screen">
422 SELECT title, ts_rank_cd(textsearch, query, 32 /* rank/(rank+1) */ ) AS rank
423 FROM apod, to_tsquery('neutrino|(dark & matter)') query
424 WHERE query @@ textsearch
428 -----------------------------------------------+-------------------
429 Neutrinos in the Sun | 0.756097569485493
430 The Sudbury Neutrino Detector | 0.705882361190954
431 A MACHO View of Galactic Dark Matter | 0.668123210574724
432 Hot Gas and Dark Matter | 0.65655958650282
433 The Virgo Cluster: Hot Plasma and Dark Matter | 0.656301290640973
434 Rafting for Solar Neutrinos | 0.655172410958162
435 NGC 4650A: Strange Galaxy and Dark Matter | 0.650072921219637
436 Hot Gas and Dark Matter | 0.617195790024749
437 Ice Fishing for Cosmic Neutrinos | 0.615384618911517
438 Weak Lensing Distorts the Universe | 0.450010798361481
441 Ranking can be expensive since it requires consulting the
442 <code class="type">tsvector</code> of each matching document, which can be I/O bound and
443 therefore slow. Unfortunately, it is almost impossible to avoid since
444 practical queries often result in large numbers of matches.
445 </p></div><div class="sect2" id="TEXTSEARCH-HEADLINE"><div class="titlepage"><div><div><h3 class="title">12.3.4. Highlighting Results <a href="#TEXTSEARCH-HEADLINE" class="id_link">#</a></h3></div></div></div><p>
446 To present search results it is ideal to show a part of each document and
447 how it is related to the query. Usually, search engines show fragments of
448 the document with marked search terms. <span class="productname">PostgreSQL</span>
449 provides a function <code class="function">ts_headline</code> that
450 implements this functionality.
451 </p><a id="id-1.5.11.6.6.3" class="indexterm"></a><pre class="synopsis">
452 ts_headline([<span class="optional"> <em class="replaceable"><code>config</code></em> <code class="type">regconfig</code>, </span>] <em class="replaceable"><code>document</code></em> <code class="type">text</code>, <em class="replaceable"><code>query</code></em> <code class="type">tsquery</code> [<span class="optional">, <em class="replaceable"><code>options</code></em> <code class="type">text</code> </span>]) returns <code class="type">text</code>
454 <code class="function">ts_headline</code> accepts a document along
455 with a query, and returns an excerpt from
456 the document in which terms from the query are highlighted.
457 Specifically, the function will use the query to select relevant
458 text fragments, and then highlight all words that appear in the query,
459 even if those word positions do not match the query's restrictions. The
460 configuration to be used to parse the document can be specified by
461 <em class="replaceable"><code>config</code></em>; if <em class="replaceable"><code>config</code></em>
463 <code class="varname">default_text_search_config</code> configuration is used.
465 If an <em class="replaceable"><code>options</code></em> string is specified it must
466 consist of a comma-separated list of one or more
467 <em class="replaceable"><code>option</code></em><code class="literal">=</code><em class="replaceable"><code>value</code></em> pairs.
468 The available options are:
470 </p><div class="itemizedlist"><ul class="itemizedlist compact" style="list-style-type: bullet; "><li class="listitem" style="list-style-type: disc"><p>
471 <code class="literal">MaxWords</code>, <code class="literal">MinWords</code> (integers):
472 these numbers determine the longest and shortest headlines to output.
473 The default values are 35 and 15.
474 </p></li><li class="listitem" style="list-style-type: disc"><p>
475 <code class="literal">ShortWord</code> (integer): words of this length or less
476 will be dropped at the start and end of a headline, unless they are
477 query terms. The default value of three eliminates common English
479 </p></li><li class="listitem" style="list-style-type: disc"><p>
480 <code class="literal">HighlightAll</code> (boolean): if
481 <code class="literal">true</code> the whole document will be used as the
482 headline, ignoring the preceding three parameters. The default
483 is <code class="literal">false</code>.
484 </p></li><li class="listitem" style="list-style-type: disc"><p>
485 <code class="literal">MaxFragments</code> (integer): maximum number of text
486 fragments to display. The default value of zero selects a
487 non-fragment-based headline generation method. A value greater
488 than zero selects fragment-based headline generation (see below).
489 </p></li><li class="listitem" style="list-style-type: disc"><p>
490 <code class="literal">StartSel</code>, <code class="literal">StopSel</code> (strings):
491 the strings with which to delimit query words appearing in the
492 document, to distinguish them from other excerpted words. The
493 default values are <span class="quote">“<span class="quote"><code class="literal"><b></code></span>”</span> and
494 <span class="quote">“<span class="quote"><code class="literal"></b></code></span>”</span>, which can be suitable
495 for HTML output (but see the warning below).
496 </p></li><li class="listitem" style="list-style-type: disc"><p>
497 <code class="literal">FragmentDelimiter</code> (string): When more than one
498 fragment is displayed, the fragments will be separated by this string.
499 The default is <span class="quote">“<span class="quote"><code class="literal"> ... </code></span>”</span>.
500 </p></li></ul></div><p>
502 </p><div class="warning"><h3 class="title">Warning: Cross-site Scripting (XSS) Safety</h3><p>
503 The output from <code class="function">ts_headline</code> is not guaranteed to
504 be safe for direct inclusion in web pages. When
505 <code class="literal">HighlightAll</code> is <code class="literal">false</code> (the
506 default), some simple XML tags are removed from the document, but this
507 is not guaranteed to remove all HTML markup. Therefore, this does not
508 provide an effective defense against attacks such as cross-site
509 scripting (XSS) attacks, when working with untrusted input. To guard
510 against such attacks, all HTML markup should be removed from the input
511 document, or an HTML sanitizer should be used on the output.
514 These option names are recognized case-insensitively.
515 You must double-quote string values if they contain spaces or commas.
517 In non-fragment-based headline
518 generation, <code class="function">ts_headline</code> locates matches for the
519 given <em class="replaceable"><code>query</code></em> and chooses a
520 single one to display, preferring matches that have more query words
521 within the allowed headline length.
522 In fragment-based headline generation, <code class="function">ts_headline</code>
523 locates the query matches and splits each match
524 into <span class="quote">“<span class="quote">fragments</span>”</span> of no more than <code class="literal">MaxWords</code>
525 words each, preferring fragments with more query words, and when
526 possible <span class="quote">“<span class="quote">stretching</span>”</span> fragments to include surrounding
527 words. The fragment-based mode is thus more useful when the query
528 matches span large sections of the document, or when it's desirable to
529 display multiple matches.
530 In either mode, if no query matches can be identified, then a single
531 fragment of the first <code class="literal">MinWords</code> words in the document
536 </p><pre class="screen">
537 SELECT ts_headline('english',
538 'The most common type of search
539 is to find all documents containing given query terms
540 and return them in order of their similarity to the
542 to_tsquery('english', 'query & similarity'));
544 ------------------------------------------------------------
545 containing given <b>query</b> terms +
546 and return them in order of their <b>similarity</b> to the+
547 <b>query</b>.
549 SELECT ts_headline('english',
550 'Search terms may occur
551 many times in a document,
552 requiring ranking of the search matches to decide which
553 occurrences to display in the result.',
554 to_tsquery('english', 'search & term'),
555 'MaxFragments=10, MaxWords=7, MinWords=3, StartSel=<<, StopSel=>>');
557 ------------------------------------------------------------
558 <<Search>> <<terms>> may occur +
559 many times ... ranking of the <<search>> matches to decide
562 <code class="function">ts_headline</code> uses the original document, not a
563 <code class="type">tsvector</code> summary, so it can be slow and should be used with
565 </p></div></div><div class="navfooter"><hr /><table width="100%" summary="Navigation footer"><tr><td width="40%" align="left"><a accesskey="p" href="textsearch-tables.html" title="12.2. Tables and Indexes">Prev</a> </td><td width="20%" align="center"><a accesskey="u" href="textsearch.html" title="Chapter 12. Full Text Search">Up</a></td><td width="40%" align="right"> <a accesskey="n" href="textsearch-features.html" title="12.4. Additional Features">Next</a></td></tr><tr><td width="40%" align="left" valign="top">12.2. Tables and Indexes </td><td width="20%" align="center"><a accesskey="h" href="index.html" title="PostgreSQL 18.0 Documentation">Home</a></td><td width="40%" align="right" valign="top"> 12.4. Additional Features</td></tr></table></div></body></html>