begriffs open source - ai-pg/blob - full-docs/html/textsearch-controls.html

   1 <?xml version="1.0" encoding="UTF-8" standalone="no"?>
   2 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>12.3. Controlling Text Search</title><link rel="stylesheet" type="text/css" href="stylesheet.css" /><link rev="made" href="pgsql-docs@lists.postgresql.org" /><meta name="generator" content="DocBook XSL Stylesheets Vsnapshot" /><link rel="prev" href="textsearch-tables.html" title="12.2. Tables and Indexes" /><link rel="next" href="textsearch-features.html" title="12.4. Additional Features" /></head><body id="docContent" class="container-fluid col-10"><div class="navheader"><table width="100%" summary="Navigation header"><tr><th colspan="5" align="center">12.3. Controlling Text Search</th></tr><tr><td width="10%" align="left"><a accesskey="p" href="textsearch-tables.html" title="12.2. Tables and Indexes">Prev</a> </td><td width="10%" align="left"><a accesskey="u" href="textsearch.html" title="Chapter 12. Full Text Search">Up</a></td><th width="60%" align="center">Chapter 12. Full Text Search</th><td width="10%" align="right"><a accesskey="h" href="index.html" title="PostgreSQL 18.0 Documentation">Home</a></td><td width="10%" align="right"> <a accesskey="n" href="textsearch-features.html" title="12.4. Additional Features">Next</a></td></tr></table><hr /></div><div class="sect1" id="TEXTSEARCH-CONTROLS"><div class="titlepage"><div><div><h2 class="title" style="clear: both">12.3. Controlling Text Search <a href="#TEXTSEARCH-CONTROLS" class="id_link">#</a></h2></div></div></div><div class="toc"><dl class="toc"><dt><span class="sect2"><a href="textsearch-controls.html#TEXTSEARCH-PARSING-DOCUMENTS">12.3.1. Parsing Documents</a></span></dt><dt><span class="sect2"><a href="textsearch-controls.html#TEXTSEARCH-PARSING-QUERIES">12.3.2. Parsing Queries</a></span></dt><dt><span class="sect2"><a href="textsearch-controls.html#TEXTSEARCH-RANKING">12.3.3. Ranking Search Results</a></span></dt><dt><span class="sect2"><a href="textsearch-controls.html#TEXTSEARCH-HEADLINE">12.3.4. Highlighting Results</a></span></dt></dl></div><p>
   3    To implement full text searching there must be a function to create a
   4    <code class="type">tsvector</code> from a document and a <code class="type">tsquery</code> from a
   5    user query. Also, we need to return results in a useful order, so we need
   6    a function that compares documents with respect to their relevance to
   7    the query. It's also important to be able to display the results nicely.
   8    <span class="productname">PostgreSQL</span> provides support for all of these
   9    functions.
  10   </p><div class="sect2" id="TEXTSEARCH-PARSING-DOCUMENTS"><div class="titlepage"><div><div><h3 class="title">12.3.1. Parsing Documents <a href="#TEXTSEARCH-PARSING-DOCUMENTS" class="id_link">#</a></h3></div></div></div><p>
  11     <span class="productname">PostgreSQL</span> provides the
  12     function <code class="function">to_tsvector</code> for converting a document to
  13     the <code class="type">tsvector</code> data type.
  14    </p><a id="id-1.5.11.6.3.3" class="indexterm"></a><pre class="synopsis">
  15 to_tsvector([<span class="optional"> <em class="replaceable"><code>config</code></em> <code class="type">regconfig</code>, </span>] <em class="replaceable"><code>document</code></em> <code class="type">text</code>) returns <code class="type">tsvector</code>
  16 </pre><p>
  17     <code class="function">to_tsvector</code> parses a textual document into tokens,
  18     reduces the tokens to lexemes, and returns a <code class="type">tsvector</code> which
  19     lists the lexemes together with their positions in the document.
  20     The document is processed according to the specified or default
  21     text search configuration.
  22     Here is a simple example:
  23
  24 </p><pre class="screen">
  25 SELECT to_tsvector('english', 'a fat  cat sat on a mat - it ate a fat rats');
  26                   to_tsvector
  27 -----------------------------------------------------
  28  'ate':9 'cat':3 'fat':2,11 'mat':7 'rat':12 'sat':4
  29 </pre><p>
  30    </p><p>
  31     In the example above we see that the resulting <code class="type">tsvector</code> does not
  32     contain the words <code class="literal">a</code>, <code class="literal">on</code>, or
  33     <code class="literal">it</code>, the word <code class="literal">rats</code> became
  34     <code class="literal">rat</code>, and the punctuation sign <code class="literal">-</code> was
  35     ignored.
  36    </p><p>
  37     The <code class="function">to_tsvector</code> function internally calls a parser
  38     which breaks the document text into tokens and assigns a type to
  39     each token.  For each token, a list of
  40     dictionaries (<a class="xref" href="textsearch-dictionaries.html" title="12.6. Dictionaries">Section 12.6</a>) is consulted,
  41     where the list can vary depending on the token type.  The first dictionary
  42     that <em class="firstterm">recognizes</em> the token emits one or more normalized
  43     <em class="firstterm">lexemes</em> to represent the token.  For example,
  44     <code class="literal">rats</code> became <code class="literal">rat</code> because one of the
  45     dictionaries recognized that the word <code class="literal">rats</code> is a plural
  46     form of <code class="literal">rat</code>.  Some words are recognized as
  47     <em class="firstterm">stop words</em> (<a class="xref" href="textsearch-dictionaries.html#TEXTSEARCH-STOPWORDS" title="12.6.1. Stop Words">Section 12.6.1</a>), which
  48     causes them to be ignored since they occur too frequently to be useful in
  49     searching.  In our example these are
  50     <code class="literal">a</code>, <code class="literal">on</code>, and <code class="literal">it</code>.
  51     If no dictionary in the list recognizes the token then it is also ignored.
  52     In this example that happened to the punctuation sign <code class="literal">-</code>
  53     because there are in fact no dictionaries assigned for its token type
  54     (<code class="literal">Space symbols</code>), meaning space tokens will never be
  55     indexed. The choices of parser, dictionaries and which types of tokens to
  56     index are determined by the selected text search configuration (<a class="xref" href="textsearch-configuration.html" title="12.7. Configuration Example">Section 12.7</a>).  It is possible to have
  57     many different configurations in the same database, and predefined
  58     configurations are available for various languages. In our example
  59     we used the default configuration <code class="literal">english</code> for the
  60     English language.
  61    </p><p>
  62     The function <code class="function">setweight</code> can be used to label the
  63     entries of a <code class="type">tsvector</code> with a given <em class="firstterm">weight</em>,
  64     where a weight is one of the letters <code class="literal">A</code>, <code class="literal">B</code>,
  65     <code class="literal">C</code>, or <code class="literal">D</code>.
  66     This is typically used to mark entries coming from
  67     different parts of a document, such as title versus body.  Later, this
  68     information can be used for ranking of search results.
  69    </p><p>
  70     Because <code class="function">to_tsvector</code>(<code class="literal">NULL</code>) will
  71     return <code class="literal">NULL</code>, it is recommended to use
  72     <code class="function">coalesce</code> whenever a field might be null.
  73     Here is the recommended method for creating
  74     a <code class="type">tsvector</code> from a structured document:
  75
  76 </p><pre class="programlisting">
  77 UPDATE tt SET ti =
  78     setweight(to_tsvector(coalesce(title,'')), 'A')    ||
  79     setweight(to_tsvector(coalesce(keyword,'')), 'B')  ||
  80     setweight(to_tsvector(coalesce(abstract,'')), 'C') ||
  81     setweight(to_tsvector(coalesce(body,'')), 'D');
  82 </pre><p>
  83
  84     Here we have used <code class="function">setweight</code> to label the source
  85     of each lexeme in the finished <code class="type">tsvector</code>, and then merged
  86     the labeled <code class="type">tsvector</code> values using the <code class="type">tsvector</code>
  87     concatenation operator <code class="literal">||</code>.  (<a class="xref" href="textsearch-features.html#TEXTSEARCH-MANIPULATE-TSVECTOR" title="12.4.1. Manipulating Documents">Section 12.4.1</a> gives details about these
  88     operations.)
  89    </p></div><div class="sect2" id="TEXTSEARCH-PARSING-QUERIES"><div class="titlepage"><div><div><h3 class="title">12.3.2. Parsing Queries <a href="#TEXTSEARCH-PARSING-QUERIES" class="id_link">#</a></h3></div></div></div><p>
  90     <span class="productname">PostgreSQL</span> provides the
  91     functions <code class="function">to_tsquery</code>,
  92     <code class="function">plainto_tsquery</code>,
  93     <code class="function">phraseto_tsquery</code> and
  94     <code class="function">websearch_to_tsquery</code>
  95     for converting a query to the <code class="type">tsquery</code> data type.
  96     <code class="function">to_tsquery</code> offers access to more features
  97     than either <code class="function">plainto_tsquery</code> or
  98     <code class="function">phraseto_tsquery</code>, but it is less forgiving about its
  99     input. <code class="function">websearch_to_tsquery</code> is a simplified version
 100     of <code class="function">to_tsquery</code> with an alternative syntax, similar
 101     to the one used by web search engines.
 102    </p><a id="id-1.5.11.6.4.3" class="indexterm"></a><pre class="synopsis">
 103 to_tsquery([<span class="optional"> <em class="replaceable"><code>config</code></em> <code class="type">regconfig</code>, </span>] <em class="replaceable"><code>querytext</code></em> <code class="type">text</code>) returns <code class="type">tsquery</code>
 104 </pre><p>
 105     <code class="function">to_tsquery</code> creates a <code class="type">tsquery</code> value from
 106     <em class="replaceable"><code>querytext</code></em>, which must consist of single tokens
 107     separated by the <code class="type">tsquery</code> operators <code class="literal">&amp;</code> (AND),
 108     <code class="literal">|</code> (OR), <code class="literal">!</code> (NOT), and
 109     <code class="literal">&lt;-&gt;</code> (FOLLOWED BY), possibly grouped
 110     using parentheses.  In other words, the input to
 111     <code class="function">to_tsquery</code> must already follow the general rules for
 112     <code class="type">tsquery</code> input, as described in <a class="xref" href="datatype-textsearch.html#DATATYPE-TSQUERY" title="8.11.2. tsquery">Section 8.11.2</a>.  The difference is that while basic
 113     <code class="type">tsquery</code> input takes the tokens at face value,
 114     <code class="function">to_tsquery</code> normalizes each token into a lexeme using
 115     the specified or default configuration, and discards any tokens that are
 116     stop words according to the configuration.  For example:
 117
 118 </p><pre class="screen">
 119 SELECT to_tsquery('english', 'The &amp; Fat &amp; Rats');
 120   to_tsquery
 121 ---------------
 122  'fat' &amp; 'rat'
 123 </pre><p>
 124
 125     As in basic <code class="type">tsquery</code> input, weight(s) can be attached to each
 126     lexeme to restrict it to match only <code class="type">tsvector</code> lexemes of those
 127     weight(s).  For example:
 128
 129 </p><pre class="screen">
 130 SELECT to_tsquery('english', 'Fat | Rats:AB');
 131     to_tsquery
 132 ------------------
 133  'fat' | 'rat':AB
 134 </pre><p>
 135
 136     Also, <code class="literal">*</code> can be attached to a lexeme to specify prefix matching:
 137
 138 </p><pre class="screen">
 139 SELECT to_tsquery('supern:*A &amp; star:A*B');
 140         to_tsquery
 141 --------------------------
 142  'supern':*A &amp; 'star':*AB
 143 </pre><p>
 144
 145     Such a lexeme will match any word in a <code class="type">tsvector</code> that begins
 146     with the given string.
 147    </p><p>
 148     <code class="function">to_tsquery</code> can also accept single-quoted
 149     phrases.  This is primarily useful when the configuration includes a
 150     thesaurus dictionary that may trigger on such phrases.
 151     In the example below, a thesaurus contains the rule <code class="literal">supernovae
 152     stars : sn</code>:
 153
 154 </p><pre class="screen">
 155 SELECT to_tsquery('''supernovae stars'' &amp; !crab');
 156   to_tsquery
 157 ---------------
 158  'sn' &amp; !'crab'
 159 </pre><p>
 160
 161     Without quotes, <code class="function">to_tsquery</code> will generate a syntax
 162     error for tokens that are not separated by an AND, OR, or FOLLOWED BY
 163     operator.
 164    </p><a id="id-1.5.11.6.4.7" class="indexterm"></a><pre class="synopsis">
 165 plainto_tsquery([<span class="optional"> <em class="replaceable"><code>config</code></em> <code class="type">regconfig</code>, </span>] <em class="replaceable"><code>querytext</code></em> <code class="type">text</code>) returns <code class="type">tsquery</code>
 166 </pre><p>
 167     <code class="function">plainto_tsquery</code> transforms the unformatted text
 168     <em class="replaceable"><code>querytext</code></em> to a <code class="type">tsquery</code> value.
 169     The text is parsed and normalized much as for <code class="function">to_tsvector</code>,
 170     then the <code class="literal">&amp;</code> (AND) <code class="type">tsquery</code> operator is
 171     inserted between surviving words.
 172    </p><p>
 173     Example:
 174
 175 </p><pre class="screen">
 176 SELECT plainto_tsquery('english', 'The Fat Rats');
 177  plainto_tsquery
 178 -----------------
 179  'fat' &amp; 'rat'
 180 </pre><p>
 181
 182     Note that <code class="function">plainto_tsquery</code> will not
 183     recognize <code class="type">tsquery</code> operators, weight labels,
 184     or prefix-match labels in its input:
 185
 186 </p><pre class="screen">
 187 SELECT plainto_tsquery('english', 'The Fat &amp; Rats:C');
 188    plainto_tsquery
 189 ---------------------
 190  'fat' &amp; 'rat' &amp; 'c'
 191 </pre><p>
 192
 193     Here, all the input punctuation was discarded.
 194    </p><a id="id-1.5.11.6.4.11" class="indexterm"></a><pre class="synopsis">
 195 phraseto_tsquery([<span class="optional"> <em class="replaceable"><code>config</code></em> <code class="type">regconfig</code>, </span>] <em class="replaceable"><code>querytext</code></em> <code class="type">text</code>) returns <code class="type">tsquery</code>
 196 </pre><p>
 197     <code class="function">phraseto_tsquery</code> behaves much like
 198     <code class="function">plainto_tsquery</code>, except that it inserts
 199     the <code class="literal">&lt;-&gt;</code> (FOLLOWED BY) operator between
 200     surviving words instead of the <code class="literal">&amp;</code> (AND) operator.
 201     Also, stop words are not simply discarded, but are accounted for by
 202     inserting <code class="literal">&lt;<em class="replaceable"><code>N</code></em>&gt;</code> operators rather
 203     than <code class="literal">&lt;-&gt;</code> operators.  This function is useful
 204     when searching for exact lexeme sequences, since the FOLLOWED BY
 205     operators check lexeme order not just the presence of all the lexemes.
 206    </p><p>
 207     Example:
 208
 209 </p><pre class="screen">
 210 SELECT phraseto_tsquery('english', 'The Fat Rats');
 211  phraseto_tsquery
 212 ------------------
 213  'fat' &lt;-&gt; 'rat'
 214 </pre><p>
 215
 216     Like <code class="function">plainto_tsquery</code>, the
 217     <code class="function">phraseto_tsquery</code> function will not
 218     recognize <code class="type">tsquery</code> operators, weight labels,
 219     or prefix-match labels in its input:
 220
 221 </p><pre class="screen">
 222 SELECT phraseto_tsquery('english', 'The Fat &amp; Rats:C');
 223       phraseto_tsquery
 224 -----------------------------
 225  'fat' &lt;-&gt; 'rat' &lt;-&gt; 'c'
 226 </pre><p>
 227    </p><pre class="synopsis">
 228 websearch_to_tsquery([<span class="optional"> <em class="replaceable"><code>config</code></em> <code class="type">regconfig</code>, </span>] <em class="replaceable"><code>querytext</code></em> <code class="type">text</code>) returns <code class="type">tsquery</code>
 229 </pre><p>
 230     <code class="function">websearch_to_tsquery</code> creates a <code class="type">tsquery</code>
 231     value from <em class="replaceable"><code>querytext</code></em> using an alternative
 232     syntax in which simple unformatted text is a valid query.
 233     Unlike <code class="function">plainto_tsquery</code>
 234     and <code class="function">phraseto_tsquery</code>, it also recognizes certain
 235     operators. Moreover, this function will never raise syntax errors,
 236     which makes it possible to use raw user-supplied input for search.
 237     The following syntax is supported:
 238
 239     </p><div class="itemizedlist"><ul class="itemizedlist compact" style="list-style-type: bullet; "><li class="listitem" style="list-style-type: disc"><p>
 240         <code class="literal">unquoted text</code>: text not inside quote marks will be
 241         converted to terms separated by <code class="literal">&amp;</code> operators, as
 242         if processed by <code class="function">plainto_tsquery</code>.
 243       </p></li><li class="listitem" style="list-style-type: disc"><p>
 244         <code class="literal">"quoted text"</code>: text inside quote marks will be
 245         converted to terms separated by <code class="literal">&lt;-&gt;</code>
 246         operators, as if processed by <code class="function">phraseto_tsquery</code>.
 247       </p></li><li class="listitem" style="list-style-type: disc"><p>
 248        <code class="literal">OR</code>: the word <span class="quote">“<span class="quote">or</span>”</span> will be converted to
 249        the <code class="literal">|</code> operator.
 250       </p></li><li class="listitem" style="list-style-type: disc"><p>
 251        <code class="literal">-</code>: a dash will be converted to
 252        the <code class="literal">!</code> operator.
 253       </p></li></ul></div><p>
 254
 255     Other punctuation is ignored.  So
 256     like <code class="function">plainto_tsquery</code>
 257     and <code class="function">phraseto_tsquery</code>,
 258     the <code class="function">websearch_to_tsquery</code> function will not
 259     recognize <code class="type">tsquery</code> operators, weight labels, or prefix-match
 260     labels in its input.
 261    </p><p>
 262     Examples:
 263 </p><pre class="screen">
 264 SELECT websearch_to_tsquery('english', 'The fat rats');
 265  websearch_to_tsquery
 266 ----------------------
 267  'fat' &amp; 'rat'
 268 (1 row)
 269
 270 SELECT websearch_to_tsquery('english', '"supernovae stars" -crab');
 271        websearch_to_tsquery
 272 ----------------------------------
 273  'supernova' &lt;-&gt; 'star' &amp; !'crab'
 274 (1 row)
 275
 276 SELECT websearch_to_tsquery('english', '"sad cat" or "fat rat"');
 277        websearch_to_tsquery
 278 -----------------------------------
 279  'sad' &lt;-&gt; 'cat' | 'fat' &lt;-&gt; 'rat'
 280 (1 row)
 281
 282 SELECT websearch_to_tsquery('english', 'signal -"segmentation fault"');
 283          websearch_to_tsquery
 284 ---------------------------------------
 285  'signal' &amp; !( 'segment' &lt;-&gt; 'fault' )
 286 (1 row)
 287
 288 SELECT websearch_to_tsquery('english', '""" )( dummy \\ query &lt;-&gt;');
 289  websearch_to_tsquery
 290 ----------------------
 291  'dummi' &amp; 'queri'
 292 (1 row)
 293 </pre><p>
 294     </p></div><div class="sect2" id="TEXTSEARCH-RANKING"><div class="titlepage"><div><div><h3 class="title">12.3.3. Ranking Search Results <a href="#TEXTSEARCH-RANKING" class="id_link">#</a></h3></div></div></div><p>
 295     Ranking attempts to measure how relevant documents are to a particular
 296     query, so that when there are many matches the most relevant ones can be
 297     shown first.  <span class="productname">PostgreSQL</span> provides two
 298     predefined ranking functions, which take into account lexical, proximity,
 299     and structural information; that is, they consider how often the query
 300     terms appear in the document, how close together the terms are in the
 301     document, and how important is the part of the document where they occur.
 302     However, the concept of relevancy is vague and very application-specific.
 303     Different applications might require additional information for ranking,
 304     e.g., document modification time.  The built-in ranking functions are only
 305     examples.  You can write your own ranking functions and/or combine their
 306     results with additional factors to fit your specific needs.
 307    </p><p>
 308     The two ranking functions currently available are:
 309
 310     </p><div class="variablelist"><dl class="variablelist"><dt><span class="term">
 311        <a id="id-1.5.11.6.5.3.1.1.1.1" class="indexterm"></a>
 312
 313        <code class="literal">ts_rank([<span class="optional"> <em class="replaceable"><code>weights</code></em> <code class="type">float4[]</code>, </span>] <em class="replaceable"><code>vector</code></em> <code class="type">tsvector</code>, <em class="replaceable"><code>query</code></em> <code class="type">tsquery</code> [<span class="optional">, <em class="replaceable"><code>normalization</code></em> <code class="type">integer</code> </span>]) returns <code class="type">float4</code></code>
 314       </span></dt><dd><p>
 315         Ranks vectors based on the frequency of their matching lexemes.
 316        </p></dd><dt><span class="term">
 317       <a id="id-1.5.11.6.5.3.1.2.1.1" class="indexterm"></a>
 318
 319        <code class="literal">ts_rank_cd([<span class="optional"> <em class="replaceable"><code>weights</code></em> <code class="type">float4[]</code>, </span>] <em class="replaceable"><code>vector</code></em> <code class="type">tsvector</code>, <em class="replaceable"><code>query</code></em> <code class="type">tsquery</code> [<span class="optional">, <em class="replaceable"><code>normalization</code></em> <code class="type">integer</code> </span>]) returns <code class="type">float4</code></code>
 320       </span></dt><dd><p>
 321         This function computes the <em class="firstterm">cover density</em>
 322         ranking for the given document vector and query, as described in
 323         Clarke, Cormack, and Tudhope's "Relevance Ranking for One to Three
 324         Term Queries" in the journal "Information Processing and Management",
 325         1999.  Cover density is similar to <code class="function">ts_rank</code> ranking
 326         except that the proximity of matching lexemes to each other is
 327         taken into consideration.
 328        </p><p>
 329         This function requires lexeme positional information to perform
 330         its calculation.  Therefore, it ignores any <span class="quote">“<span class="quote">stripped</span>”</span>
 331         lexemes in the <code class="type">tsvector</code>.  If there are no unstripped
 332         lexemes in the input, the result will be zero.  (See <a class="xref" href="textsearch-features.html#TEXTSEARCH-MANIPULATE-TSVECTOR" title="12.4.1. Manipulating Documents">Section 12.4.1</a> for more information
 333         about the <code class="function">strip</code> function and positional information
 334         in <code class="type">tsvector</code>s.)
 335        </p></dd></dl></div><p>
 336
 337    </p><p>
 338     For both these functions,
 339     the optional <em class="replaceable"><code>weights</code></em>
 340     argument offers the ability to weigh word instances more or less
 341     heavily depending on how they are labeled.  The weight arrays specify
 342     how heavily to weigh each category of word, in the order:
 343
 344 </p><pre class="synopsis">
 345 {D-weight, C-weight, B-weight, A-weight}
 346 </pre><p>
 347
 348     If no <em class="replaceable"><code>weights</code></em> are provided,
 349     then these defaults are used:
 350
 351 </p><pre class="programlisting">
 352 {0.1, 0.2, 0.4, 1.0}
 353 </pre><p>
 354
 355     Typically weights are used to mark words from special areas of the
 356     document, like the title or an initial abstract, so they can be
 357     treated with more or less importance than words in the document body.
 358    </p><p>
 359     Since a longer document has a greater chance of containing a query term
 360     it is reasonable to take into account document size, e.g., a hundred-word
 361     document with five instances of a search word is probably more relevant
 362     than a thousand-word document with five instances.  Both ranking functions
 363     take an integer <em class="replaceable"><code>normalization</code></em> option that
 364     specifies whether and how a document's length should impact its rank.
 365     The integer option controls several behaviors, so it is a bit mask:
 366     you can specify one or more behaviors using
 367     <code class="literal">|</code> (for example, <code class="literal">2|4</code>).
 368
 369     </p><div class="itemizedlist"><ul class="itemizedlist compact" style="list-style-type: bullet; "><li class="listitem" style="list-style-type: disc"><p>
 370        0 (the default) ignores the document length
 371       </p></li><li class="listitem" style="list-style-type: disc"><p>
 372        1 divides the rank by 1 + the logarithm of the document length
 373       </p></li><li class="listitem" style="list-style-type: disc"><p>
 374        2 divides the rank by the document length
 375       </p></li><li class="listitem" style="list-style-type: disc"><p>
 376        4 divides the rank by the mean harmonic distance between extents
 377        (this is implemented only by <code class="function">ts_rank_cd</code>)
 378       </p></li><li class="listitem" style="list-style-type: disc"><p>
 379        8 divides the rank by the number of unique words in document
 380       </p></li><li class="listitem" style="list-style-type: disc"><p>
 381        16 divides the rank by 1 + the logarithm of the number
 382        of unique words in document
 383       </p></li><li class="listitem" style="list-style-type: disc"><p>
 384        32 divides the rank by itself + 1
 385       </p></li></ul></div><p>
 386
 387     If more than one flag bit is specified, the transformations are
 388     applied in the order listed.
 389    </p><p>
 390     It is important to note that the ranking functions do not use any global
 391     information, so it is impossible to produce a fair normalization to 1% or
 392     100% as sometimes desired.  Normalization option 32
 393     (<code class="literal">rank/(rank+1)</code>) can be applied to scale all ranks
 394     into the range zero to one, but of course this is just a cosmetic change;
 395     it will not affect the ordering of the search results.
 396    </p><p>
 397     Here is an example that selects only the ten highest-ranked matches:
 398
 399 </p><pre class="screen">
 400 SELECT title, ts_rank_cd(textsearch, query) AS rank
 401 FROM apod, to_tsquery('neutrino|(dark &amp; matter)') query
 402 WHERE query @@ textsearch
 403 ORDER BY rank DESC
 404 LIMIT 10;
 405                      title                     |   rank
 406 -----------------------------------------------+----------
 407  Neutrinos in the Sun                          |      3.1
 408  The Sudbury Neutrino Detector                 |      2.4
 409  A MACHO View of Galactic Dark Matter          |  2.01317
 410  Hot Gas and Dark Matter                       |  1.91171
 411  The Virgo Cluster: Hot Plasma and Dark Matter |  1.90953
 412  Rafting for Solar Neutrinos                   |      1.9
 413  NGC 4650A: Strange Galaxy and Dark Matter     |  1.85774
 414  Hot Gas and Dark Matter                       |   1.6123
 415  Ice Fishing for Cosmic Neutrinos              |      1.6
 416  Weak Lensing Distorts the Universe            | 0.818218
 417 </pre><p>
 418
 419     This is the same example using normalized ranking:
 420
 421 </p><pre class="screen">
 422 SELECT title, ts_rank_cd(textsearch, query, 32 /* rank/(rank+1) */ ) AS rank
 423 FROM apod, to_tsquery('neutrino|(dark &amp; matter)') query
 424 WHERE  query @@ textsearch
 425 ORDER BY rank DESC
 426 LIMIT 10;
 427                      title                     |        rank
 428 -----------------------------------------------+-------------------
 429  Neutrinos in the Sun                          | 0.756097569485493
 430  The Sudbury Neutrino Detector                 | 0.705882361190954
 431  A MACHO View of Galactic Dark Matter          | 0.668123210574724
 432  Hot Gas and Dark Matter                       |  0.65655958650282
 433  The Virgo Cluster: Hot Plasma and Dark Matter | 0.656301290640973
 434  Rafting for Solar Neutrinos                   | 0.655172410958162
 435  NGC 4650A: Strange Galaxy and Dark Matter     | 0.650072921219637
 436  Hot Gas and Dark Matter                       | 0.617195790024749
 437  Ice Fishing for Cosmic Neutrinos              | 0.615384618911517
 438  Weak Lensing Distorts the Universe            | 0.450010798361481
 439 </pre><p>
 440    </p><p>
 441     Ranking can be expensive since it requires consulting the
 442     <code class="type">tsvector</code> of each matching document, which can be I/O bound and
 443     therefore slow. Unfortunately, it is almost impossible to avoid since
 444     practical queries often result in large numbers of matches.
 445    </p></div><div class="sect2" id="TEXTSEARCH-HEADLINE"><div class="titlepage"><div><div><h3 class="title">12.3.4. Highlighting Results <a href="#TEXTSEARCH-HEADLINE" class="id_link">#</a></h3></div></div></div><p>
 446     To present search results it is ideal to show a part of each document and
 447     how it is related to the query. Usually, search engines show fragments of
 448     the document with marked search terms.  <span class="productname">PostgreSQL</span>
 449     provides a function <code class="function">ts_headline</code> that
 450     implements this functionality.
 451    </p><a id="id-1.5.11.6.6.3" class="indexterm"></a><pre class="synopsis">
 452 ts_headline([<span class="optional"> <em class="replaceable"><code>config</code></em> <code class="type">regconfig</code>, </span>] <em class="replaceable"><code>document</code></em> <code class="type">text</code>, <em class="replaceable"><code>query</code></em> <code class="type">tsquery</code> [<span class="optional">, <em class="replaceable"><code>options</code></em> <code class="type">text</code> </span>]) returns <code class="type">text</code>
 453 </pre><p>
 454     <code class="function">ts_headline</code> accepts a document along
 455     with a query, and returns an excerpt from
 456     the document in which terms from the query are highlighted.
 457     Specifically, the function will use the query to select relevant
 458     text fragments, and then highlight all words that appear in the query,
 459     even if those word positions do not match the query's restrictions.  The
 460     configuration to be used to parse the document can be specified by
 461     <em class="replaceable"><code>config</code></em>; if <em class="replaceable"><code>config</code></em>
 462     is omitted, the
 463     <code class="varname">default_text_search_config</code> configuration is used.
 464    </p><p>
 465     If an <em class="replaceable"><code>options</code></em> string is specified it must
 466     consist of a comma-separated list of one or more
 467     <em class="replaceable"><code>option</code></em><code class="literal">=</code><em class="replaceable"><code>value</code></em> pairs.
 468     The available options are:
 469
 470     </p><div class="itemizedlist"><ul class="itemizedlist compact" style="list-style-type: bullet; "><li class="listitem" style="list-style-type: disc"><p>
 471        <code class="literal">MaxWords</code>, <code class="literal">MinWords</code> (integers):
 472        these numbers determine the longest and shortest headlines to output.
 473        The default values are 35 and 15.
 474       </p></li><li class="listitem" style="list-style-type: disc"><p>
 475        <code class="literal">ShortWord</code> (integer): words of this length or less
 476        will be dropped at the start and end of a headline, unless they are
 477        query terms.  The default value of three eliminates common English
 478        articles.
 479       </p></li><li class="listitem" style="list-style-type: disc"><p>
 480        <code class="literal">HighlightAll</code> (boolean): if
 481        <code class="literal">true</code> the whole document will be used as the
 482        headline, ignoring the preceding three parameters.  The default
 483        is <code class="literal">false</code>.
 484       </p></li><li class="listitem" style="list-style-type: disc"><p>
 485        <code class="literal">MaxFragments</code> (integer): maximum number of text
 486        fragments to display.  The default value of zero selects a
 487        non-fragment-based headline generation method.  A value greater
 488        than zero selects fragment-based headline generation (see below).
 489       </p></li><li class="listitem" style="list-style-type: disc"><p>
 490        <code class="literal">StartSel</code>, <code class="literal">StopSel</code> (strings):
 491        the strings with which to delimit query words appearing in the
 492        document, to distinguish them from other excerpted words.  The
 493        default values are <span class="quote">“<span class="quote"><code class="literal">&lt;b&gt;</code></span>”</span> and
 494        <span class="quote">“<span class="quote"><code class="literal">&lt;/b&gt;</code></span>”</span>, which can be suitable
 495        for HTML output (but see the warning below).
 496       </p></li><li class="listitem" style="list-style-type: disc"><p>
 497        <code class="literal">FragmentDelimiter</code> (string): When more than one
 498        fragment is displayed, the fragments will be separated by this string.
 499        The default is <span class="quote">“<span class="quote"><code class="literal"> ... </code></span>”</span>.
 500       </p></li></ul></div><p>
 501
 502     </p><div class="warning"><h3 class="title">Warning: Cross-site Scripting (XSS) Safety</h3><p>
 503       The output from <code class="function">ts_headline</code> is not guaranteed to
 504       be safe for direct inclusion in web pages. When
 505       <code class="literal">HighlightAll</code> is <code class="literal">false</code> (the
 506       default), some simple XML tags are removed from the document, but this
 507       is not guaranteed to remove all HTML markup. Therefore, this does not
 508       provide an effective defense against attacks such as cross-site
 509       scripting (XSS) attacks, when working with untrusted input. To guard
 510       against such attacks, all HTML markup should be removed from the input
 511       document, or an HTML sanitizer should be used on the output.
 512      </p></div><p>
 513
 514     These option names are recognized case-insensitively.
 515     You must double-quote string values if they contain spaces or commas.
 516    </p><p>
 517     In non-fragment-based headline
 518     generation, <code class="function">ts_headline</code> locates matches for the
 519     given <em class="replaceable"><code>query</code></em> and chooses a
 520     single one to display, preferring matches that have more query words
 521     within the allowed headline length.
 522     In fragment-based headline generation, <code class="function">ts_headline</code>
 523     locates the query matches and splits each match
 524     into <span class="quote">“<span class="quote">fragments</span>”</span> of no more than <code class="literal">MaxWords</code>
 525     words each, preferring fragments with more query words, and when
 526     possible <span class="quote">“<span class="quote">stretching</span>”</span> fragments to include surrounding
 527     words.  The fragment-based mode is thus more useful when the query
 528     matches span large sections of the document, or when it's desirable to
 529     display multiple matches.
 530     In either mode, if no query matches can be identified, then a single
 531     fragment of the first <code class="literal">MinWords</code> words in the document
 532     will be displayed.
 533    </p><p>
 534     For example:
 535
 536 </p><pre class="screen">
 537 SELECT ts_headline('english',
 538   'The most common type of search
 539 is to find all documents containing given query terms
 540 and return them in order of their similarity to the
 541 query.',
 542   to_tsquery('english', 'query &amp; similarity'));
 543                         ts_headline
 544 ------------------------------------------------------------
 545  containing given &lt;b&gt;query&lt;/b&gt; terms                       +
 546  and return them in order of their &lt;b&gt;similarity&lt;/b&gt; to the+
 547  &lt;b&gt;query&lt;/b&gt;.
 548
 549 SELECT ts_headline('english',
 550   'Search terms may occur
 551 many times in a document,
 552 requiring ranking of the search matches to decide which
 553 occurrences to display in the result.',
 554   to_tsquery('english', 'search &amp; term'),
 555   'MaxFragments=10, MaxWords=7, MinWords=3, StartSel=&lt;&lt;, StopSel=&gt;&gt;');
 556                         ts_headline
 557 ------------------------------------------------------------
 558  &lt;&lt;Search&gt;&gt; &lt;&lt;terms&gt;&gt; may occur                            +
 559  many times ... ranking of the &lt;&lt;search&gt;&gt; matches to decide
 560 </pre><p>
 561    </p><p>
 562     <code class="function">ts_headline</code> uses the original document, not a
 563     <code class="type">tsvector</code> summary, so it can be slow and should be used with
 564     care.
 565    </p></div></div><div class="navfooter"><hr /><table width="100%" summary="Navigation footer"><tr><td width="40%" align="left"><a accesskey="p" href="textsearch-tables.html" title="12.2. Tables and Indexes">Prev</a> </td><td width="20%" align="center"><a accesskey="u" href="textsearch.html" title="Chapter 12. Full Text Search">Up</a></td><td width="40%" align="right"> <a accesskey="n" href="textsearch-features.html" title="12.4. Additional Features">Next</a></td></tr><tr><td width="40%" align="left" valign="top">12.2. Tables and Indexes </td><td width="20%" align="center"><a accesskey="h" href="index.html" title="PostgreSQL 18.0 Documentation">Home</a></td><td width="40%" align="right" valign="top"> 12.4. Additional Features</td></tr></table></div></body></html>