begriffs open source - ai-pg/blob - full-docs/txt/pgtrgm.txt

   1
   2 F.35. pg_trgm — support for similarity of text using trigram matching #
   3
   4    F.35.1. Trigram (or Trigraph) Concepts
   5    F.35.2. Functions and Operators
   6    F.35.3. GUC Parameters
   7    F.35.4. Index Support
   8    F.35.5. Text Search Integration
   9    F.35.6. References
  10    F.35.7. Authors
  11
  12    The pg_trgm module provides functions and operators for determining the
  13    similarity of alphanumeric text based on trigram matching, as well as
  14    index operator classes that support fast searching for similar strings.
  15
  16    This module is considered “trusted”, that is, it can be installed by
  17    non-superusers who have CREATE privilege on the current database.
  18
  19 F.35.1. Trigram (or Trigraph) Concepts #
  20
  21    A trigram is a group of three consecutive characters taken from a
  22    string. We can measure the similarity of two strings by counting the
  23    number of trigrams they share. This simple idea turns out to be very
  24    effective for measuring the similarity of words in many natural
  25    languages.
  26
  27 Note
  28
  29    pg_trgm ignores non-word characters (non-alphanumerics) when extracting
  30    trigrams from a string. Each word is considered to have two spaces
  31    prefixed and one space suffixed when determining the set of trigrams
  32    contained in the string. For example, the set of trigrams in the string
  33    “cat” is “ c”, “ ca”, “cat”, and “at ”. The set of trigrams in the
  34    string “foo|bar” is “ f”, “ fo”, “foo”, “oo ”, “ b”, “ ba”, “bar”, and
  35    “ar ”.
  36
  37 F.35.2. Functions and Operators #
  38
  39    The functions provided by the pg_trgm module are shown in Table F.26,
  40    the operators in Table F.27.
  41
  42    Table F.26. pg_trgm Functions
  43
  44    Function
  45
  46    Description
  47
  48    similarity ( text, text ) → real
  49
  50    Returns a number that indicates how similar the two arguments are. The
  51    range of the result is zero (indicating that the two strings are
  52    completely dissimilar) to one (indicating that the two strings are
  53    identical).
  54
  55    show_trgm ( text ) → text[]
  56
  57    Returns an array of all the trigrams in the given string. (In practice
  58    this is seldom useful except for debugging.)
  59
  60    word_similarity ( text, text ) → real
  61
  62    Returns a number that indicates the greatest similarity between the set
  63    of trigrams in the first string and any continuous extent of an ordered
  64    set of trigrams in the second string. For details, see the explanation
  65    below.
  66
  67    strict_word_similarity ( text, text ) → real
  68
  69    Same as word_similarity, but forces extent boundaries to match word
  70    boundaries. Since we don't have cross-word trigrams, this function
  71    actually returns greatest similarity between first string and any
  72    continuous extent of words of the second string.
  73
  74    show_limit () → real
  75
  76    Returns the current similarity threshold used by the % operator. This
  77    sets the minimum similarity between two words for them to be considered
  78    similar enough to be misspellings of each other, for example.
  79    (Deprecated; instead use SHOW pg_trgm.similarity_threshold.)
  80
  81    set_limit ( real ) → real
  82
  83    Sets the current similarity threshold that is used by the % operator.
  84    The threshold must be between 0 and 1 (default is 0.3). Returns the
  85    same value passed in. (Deprecated; instead use SET
  86    pg_trgm.similarity_threshold.)
  87
  88    Consider the following example:
  89 # SELECT word_similarity('word', 'two words');
  90  word_similarity
  91 -----------------
  92              0.8
  93 (1 row)
  94
  95    In the first string, the set of trigrams is {" w"," wo","wor","ord","rd
  96    "}. In the second string, the ordered set of trigrams is {" t","
  97    tw","two","wo "," w"," wo","wor","ord","rds","ds "}. The most similar
  98    extent of an ordered set of trigrams in the second string is {" w","
  99    wo","wor","ord"}, and the similarity is 0.8.
 100
 101    This function returns a value that can be approximately understood as
 102    the greatest similarity between the first string and any substring of
 103    the second string. However, this function does not add padding to the
 104    boundaries of the extent. Thus, the number of additional characters
 105    present in the second string is not considered, except for the
 106    mismatched word boundaries.
 107
 108    At the same time, strict_word_similarity selects an extent of words in
 109    the second string. In the example above, strict_word_similarity would
 110    select the extent of a single word 'words', whose set of trigrams is {"
 111    w"," wo","wor","ord","rds","ds "}.
 112 # SELECT strict_word_similarity('word', 'two words'), similarity('word', 'words'
 113 );
 114  strict_word_similarity | similarity
 115 ------------------------+------------
 116                0.571429 |   0.571429
 117 (1 row)
 118
 119    Thus, the strict_word_similarity function is useful for finding the
 120    similarity to whole words, while word_similarity is more suitable for
 121    finding the similarity for parts of words.
 122
 123    Table F.27. pg_trgm Operators
 124
 125    Operator
 126
 127    Description
 128
 129    text % text → boolean
 130
 131    Returns true if its arguments have a similarity that is greater than
 132    the current similarity threshold set by pg_trgm.similarity_threshold.
 133
 134    text <% text → boolean
 135
 136    Returns true if the similarity between the trigram set in the first
 137    argument and a continuous extent of an ordered trigram set in the
 138    second argument is greater than the current word similarity threshold
 139    set by pg_trgm.word_similarity_threshold parameter.
 140
 141    text %> text → boolean
 142
 143    Commutator of the <% operator.
 144
 145    text <<% text → boolean
 146
 147    Returns true if its second argument has a continuous extent of an
 148    ordered trigram set that matches word boundaries, and its similarity to
 149    the trigram set of the first argument is greater than the current
 150    strict word similarity threshold set by the
 151    pg_trgm.strict_word_similarity_threshold parameter.
 152
 153    text %>> text → boolean
 154
 155    Commutator of the <<% operator.
 156
 157    text <-> text → real
 158
 159    Returns the “distance” between the arguments, that is one minus the
 160    similarity() value.
 161
 162    text <<-> text → real
 163
 164    Returns the “distance” between the arguments, that is one minus the
 165    word_similarity() value.
 166
 167    text <->> text → real
 168
 169    Commutator of the <<-> operator.
 170
 171    text <<<-> text → real
 172
 173    Returns the “distance” between the arguments, that is one minus the
 174    strict_word_similarity() value.
 175
 176    text <->>> text → real
 177
 178    Commutator of the <<<-> operator.
 179
 180 F.35.3. GUC Parameters #
 181
 182    pg_trgm.similarity_threshold (real) #
 183           Sets the current similarity threshold that is used by the %
 184           operator. The threshold must be between 0 and 1 (default is
 185           0.3).
 186
 187    pg_trgm.word_similarity_threshold (real) #
 188           Sets the current word similarity threshold that is used by the
 189           <% and %> operators. The threshold must be between 0 and 1
 190           (default is 0.6).
 191
 192    pg_trgm.strict_word_similarity_threshold (real) #
 193           Sets the current strict word similarity threshold that is used
 194           by the <<% and %>> operators. The threshold must be between 0
 195           and 1 (default is 0.5).
 196
 197 F.35.4. Index Support #
 198
 199    The pg_trgm module provides GiST and GIN index operator classes that
 200    allow you to create an index over a text column for the purpose of very
 201    fast similarity searches. These index types support the above-described
 202    similarity operators, and additionally support trigram-based index
 203    searches for LIKE, ILIKE, ~, ~* and = queries. The similarity
 204    comparisons are case-insensitive in a default build of pg_trgm.
 205    Inequality operators are not supported. Note that those indexes may not
 206    be as efficient as regular B-tree indexes for equality operator.
 207
 208    Example:
 209 CREATE TABLE test_trgm (t text);
 210 CREATE INDEX trgm_idx ON test_trgm USING GIST (t gist_trgm_ops);
 211
 212    or
 213 CREATE INDEX trgm_idx ON test_trgm USING GIN (t gin_trgm_ops);
 214
 215    gist_trgm_ops GiST opclass approximates a set of trigrams as a bitmap
 216    signature. Its optional integer parameter siglen determines the
 217    signature length in bytes. The default length is 12 bytes. Valid values
 218    of signature length are between 1 and 2024 bytes. Longer signatures
 219    lead to a more precise search (scanning a smaller fraction of the index
 220    and fewer heap pages), at the cost of a larger index.
 221
 222    Example of creating such an index with a signature length of 32 bytes:
 223 CREATE INDEX trgm_idx ON test_trgm USING GIST (t gist_trgm_ops(siglen=32));
 224
 225    At this point, you will have an index on the t column that you can use
 226    for similarity searching. A typical query is
 227 SELECT t, similarity(t, 'word') AS sml
 228   FROM test_trgm
 229   WHERE t % 'word'
 230   ORDER BY sml DESC, t;
 231
 232    This will return all values in the text column that are sufficiently
 233    similar to word, sorted from best match to worst. The index will be
 234    used to make this a fast operation even over very large data sets.
 235
 236    A variant of the above query is
 237 SELECT t, t <-> 'word' AS dist
 238   FROM test_trgm
 239   ORDER BY dist LIMIT 10;
 240
 241    This can be implemented quite efficiently by GiST indexes, but not by
 242    GIN indexes. It will usually beat the first formulation when only a
 243    small number of the closest matches is wanted.
 244
 245    Also you can use an index on the t column for word similarity or strict
 246    word similarity. Typical queries are:
 247 SELECT t, word_similarity('word', t) AS sml
 248   FROM test_trgm
 249   WHERE 'word' <% t
 250   ORDER BY sml DESC, t;
 251
 252    and
 253 SELECT t, strict_word_similarity('word', t) AS sml
 254   FROM test_trgm
 255   WHERE 'word' <<% t
 256   ORDER BY sml DESC, t;
 257
 258    This will return all values in the text column for which there is a
 259    continuous extent in the corresponding ordered trigram set that is
 260    sufficiently similar to the trigram set of word, sorted from best match
 261    to worst. The index will be used to make this a fast operation even
 262    over very large data sets.
 263
 264    Possible variants of the above queries are:
 265 SELECT t, 'word' <<-> t AS dist
 266   FROM test_trgm
 267   ORDER BY dist LIMIT 10;
 268
 269    and
 270 SELECT t, 'word' <<<-> t AS dist
 271   FROM test_trgm
 272   ORDER BY dist LIMIT 10;
 273
 274    This can be implemented quite efficiently by GiST indexes, but not by
 275    GIN indexes.
 276
 277    Beginning in PostgreSQL 9.1, these index types also support index
 278    searches for LIKE and ILIKE, for example
 279 SELECT * FROM test_trgm WHERE t LIKE '%foo%bar';
 280
 281    The index search works by extracting trigrams from the search string
 282    and then looking these up in the index. The more trigrams in the search
 283    string, the more effective the index search is. Unlike B-tree based
 284    searches, the search string need not be left-anchored.
 285
 286    Beginning in PostgreSQL 9.3, these index types also support index
 287    searches for regular-expression matches (~ and ~* operators), for
 288    example
 289 SELECT * FROM test_trgm WHERE t ~ '(foo|bar)';
 290
 291    The index search works by extracting trigrams from the regular
 292    expression and then looking these up in the index. The more trigrams
 293    that can be extracted from the regular expression, the more effective
 294    the index search is. Unlike B-tree based searches, the search string
 295    need not be left-anchored.
 296
 297    For both LIKE and regular-expression searches, keep in mind that a
 298    pattern with no extractable trigrams will degenerate to a full-index
 299    scan.
 300
 301    The choice between GiST and GIN indexing depends on the relative
 302    performance characteristics of GiST and GIN, which are discussed
 303    elsewhere.
 304
 305 F.35.5. Text Search Integration #
 306
 307    Trigram matching is a very useful tool when used in conjunction with a
 308    full text index. In particular it can help to recognize misspelled
 309    input words that will not be matched directly by the full text search
 310    mechanism.
 311
 312    The first step is to generate an auxiliary table containing all the
 313    unique words in the documents:
 314 CREATE TABLE words AS SELECT word FROM
 315         ts_stat('SELECT to_tsvector(''simple'', bodytext) FROM documents');
 316
 317    where documents is a table that has a text field bodytext that we wish
 318    to search. The reason for using the simple configuration with the
 319    to_tsvector function, instead of using a language-specific
 320    configuration, is that we want a list of the original (unstemmed)
 321    words.
 322
 323    Next, create a trigram index on the word column:
 324 CREATE INDEX words_idx ON words USING GIN (word gin_trgm_ops);
 325
 326    Now, a SELECT query similar to the previous example can be used to
 327    suggest spellings for misspelled words in user search terms. A useful
 328    extra test is to require that the selected words are also of similar
 329    length to the misspelled word.
 330
 331 Note
 332
 333    Since the words table has been generated as a separate, static table,
 334    it will need to be periodically regenerated so that it remains
 335    reasonably up-to-date with the document collection. Keeping it exactly
 336    current is usually unnecessary.
 337
 338 F.35.6. References #
 339
 340    GiST Development Site http://www.sai.msu.su/~megera/postgres/gist/
 341
 342    Tsearch2 Development Site
 343    http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/
 344
 345 F.35.7. Authors #
 346
 347    Oleg Bartunov <oleg@sai.msu.su>, Moscow, Moscow University, Russia
 348
 349    Teodor Sigaev <teodor@sigaev.ru>, Moscow, Delta-Soft Ltd.,Russia
 350
 351    Alexander Korotkov <a.korotkov@postgrespro.ru>, Moscow, Postgres
 352    Professional, Russia
 353
 354    Documentation: Christopher Kings-Lynne
 355
 356    This module is sponsored by Delta-Soft Ltd., Moscow, Russia.