2 F.35. pg_trgm — support for similarity of text using trigram matching #
4 F.35.1. Trigram (or Trigraph) Concepts
5 F.35.2. Functions and Operators
8 F.35.5. Text Search Integration
12 The pg_trgm module provides functions and operators for determining the
13 similarity of alphanumeric text based on trigram matching, as well as
14 index operator classes that support fast searching for similar strings.
16 This module is considered “trusted”, that is, it can be installed by
17 non-superusers who have CREATE privilege on the current database.
19 F.35.1. Trigram (or Trigraph) Concepts #
21 A trigram is a group of three consecutive characters taken from a
22 string. We can measure the similarity of two strings by counting the
23 number of trigrams they share. This simple idea turns out to be very
24 effective for measuring the similarity of words in many natural
29 pg_trgm ignores non-word characters (non-alphanumerics) when extracting
30 trigrams from a string. Each word is considered to have two spaces
31 prefixed and one space suffixed when determining the set of trigrams
32 contained in the string. For example, the set of trigrams in the string
33 “cat” is “ c”, “ ca”, “cat”, and “at ”. The set of trigrams in the
34 string “foo|bar” is “ f”, “ fo”, “foo”, “oo ”, “ b”, “ ba”, “bar”, and
37 F.35.2. Functions and Operators #
39 The functions provided by the pg_trgm module are shown in Table F.26,
40 the operators in Table F.27.
42 Table F.26. pg_trgm Functions
48 similarity ( text, text ) → real
50 Returns a number that indicates how similar the two arguments are. The
51 range of the result is zero (indicating that the two strings are
52 completely dissimilar) to one (indicating that the two strings are
55 show_trgm ( text ) → text[]
57 Returns an array of all the trigrams in the given string. (In practice
58 this is seldom useful except for debugging.)
60 word_similarity ( text, text ) → real
62 Returns a number that indicates the greatest similarity between the set
63 of trigrams in the first string and any continuous extent of an ordered
64 set of trigrams in the second string. For details, see the explanation
67 strict_word_similarity ( text, text ) → real
69 Same as word_similarity, but forces extent boundaries to match word
70 boundaries. Since we don't have cross-word trigrams, this function
71 actually returns greatest similarity between first string and any
72 continuous extent of words of the second string.
76 Returns the current similarity threshold used by the % operator. This
77 sets the minimum similarity between two words for them to be considered
78 similar enough to be misspellings of each other, for example.
79 (Deprecated; instead use SHOW pg_trgm.similarity_threshold.)
81 set_limit ( real ) → real
83 Sets the current similarity threshold that is used by the % operator.
84 The threshold must be between 0 and 1 (default is 0.3). Returns the
85 same value passed in. (Deprecated; instead use SET
86 pg_trgm.similarity_threshold.)
88 Consider the following example:
89 # SELECT word_similarity('word', 'two words');
95 In the first string, the set of trigrams is {" w"," wo","wor","ord","rd
96 "}. In the second string, the ordered set of trigrams is {" t","
97 tw","two","wo "," w"," wo","wor","ord","rds","ds "}. The most similar
98 extent of an ordered set of trigrams in the second string is {" w","
99 wo","wor","ord"}, and the similarity is 0.8.
101 This function returns a value that can be approximately understood as
102 the greatest similarity between the first string and any substring of
103 the second string. However, this function does not add padding to the
104 boundaries of the extent. Thus, the number of additional characters
105 present in the second string is not considered, except for the
106 mismatched word boundaries.
108 At the same time, strict_word_similarity selects an extent of words in
109 the second string. In the example above, strict_word_similarity would
110 select the extent of a single word 'words', whose set of trigrams is {"
111 w"," wo","wor","ord","rds","ds "}.
112 # SELECT strict_word_similarity('word', 'two words'), similarity('word', 'words'
114 strict_word_similarity | similarity
115 ------------------------+------------
119 Thus, the strict_word_similarity function is useful for finding the
120 similarity to whole words, while word_similarity is more suitable for
121 finding the similarity for parts of words.
123 Table F.27. pg_trgm Operators
129 text % text → boolean
131 Returns true if its arguments have a similarity that is greater than
132 the current similarity threshold set by pg_trgm.similarity_threshold.
134 text <% text → boolean
136 Returns true if the similarity between the trigram set in the first
137 argument and a continuous extent of an ordered trigram set in the
138 second argument is greater than the current word similarity threshold
139 set by pg_trgm.word_similarity_threshold parameter.
141 text %> text → boolean
143 Commutator of the <% operator.
145 text <<% text → boolean
147 Returns true if its second argument has a continuous extent of an
148 ordered trigram set that matches word boundaries, and its similarity to
149 the trigram set of the first argument is greater than the current
150 strict word similarity threshold set by the
151 pg_trgm.strict_word_similarity_threshold parameter.
153 text %>> text → boolean
155 Commutator of the <<% operator.
159 Returns the “distance” between the arguments, that is one minus the
162 text <<-> text → real
164 Returns the “distance” between the arguments, that is one minus the
165 word_similarity() value.
167 text <->> text → real
169 Commutator of the <<-> operator.
171 text <<<-> text → real
173 Returns the “distance” between the arguments, that is one minus the
174 strict_word_similarity() value.
176 text <->>> text → real
178 Commutator of the <<<-> operator.
180 F.35.3. GUC Parameters #
182 pg_trgm.similarity_threshold (real) #
183 Sets the current similarity threshold that is used by the %
184 operator. The threshold must be between 0 and 1 (default is
187 pg_trgm.word_similarity_threshold (real) #
188 Sets the current word similarity threshold that is used by the
189 <% and %> operators. The threshold must be between 0 and 1
192 pg_trgm.strict_word_similarity_threshold (real) #
193 Sets the current strict word similarity threshold that is used
194 by the <<% and %>> operators. The threshold must be between 0
195 and 1 (default is 0.5).
197 F.35.4. Index Support #
199 The pg_trgm module provides GiST and GIN index operator classes that
200 allow you to create an index over a text column for the purpose of very
201 fast similarity searches. These index types support the above-described
202 similarity operators, and additionally support trigram-based index
203 searches for LIKE, ILIKE, ~, ~* and = queries. The similarity
204 comparisons are case-insensitive in a default build of pg_trgm.
205 Inequality operators are not supported. Note that those indexes may not
206 be as efficient as regular B-tree indexes for equality operator.
209 CREATE TABLE test_trgm (t text);
210 CREATE INDEX trgm_idx ON test_trgm USING GIST (t gist_trgm_ops);
213 CREATE INDEX trgm_idx ON test_trgm USING GIN (t gin_trgm_ops);
215 gist_trgm_ops GiST opclass approximates a set of trigrams as a bitmap
216 signature. Its optional integer parameter siglen determines the
217 signature length in bytes. The default length is 12 bytes. Valid values
218 of signature length are between 1 and 2024 bytes. Longer signatures
219 lead to a more precise search (scanning a smaller fraction of the index
220 and fewer heap pages), at the cost of a larger index.
222 Example of creating such an index with a signature length of 32 bytes:
223 CREATE INDEX trgm_idx ON test_trgm USING GIST (t gist_trgm_ops(siglen=32));
225 At this point, you will have an index on the t column that you can use
226 for similarity searching. A typical query is
227 SELECT t, similarity(t, 'word') AS sml
230 ORDER BY sml DESC, t;
232 This will return all values in the text column that are sufficiently
233 similar to word, sorted from best match to worst. The index will be
234 used to make this a fast operation even over very large data sets.
236 A variant of the above query is
237 SELECT t, t <-> 'word' AS dist
239 ORDER BY dist LIMIT 10;
241 This can be implemented quite efficiently by GiST indexes, but not by
242 GIN indexes. It will usually beat the first formulation when only a
243 small number of the closest matches is wanted.
245 Also you can use an index on the t column for word similarity or strict
246 word similarity. Typical queries are:
247 SELECT t, word_similarity('word', t) AS sml
250 ORDER BY sml DESC, t;
253 SELECT t, strict_word_similarity('word', t) AS sml
256 ORDER BY sml DESC, t;
258 This will return all values in the text column for which there is a
259 continuous extent in the corresponding ordered trigram set that is
260 sufficiently similar to the trigram set of word, sorted from best match
261 to worst. The index will be used to make this a fast operation even
262 over very large data sets.
264 Possible variants of the above queries are:
265 SELECT t, 'word' <<-> t AS dist
267 ORDER BY dist LIMIT 10;
270 SELECT t, 'word' <<<-> t AS dist
272 ORDER BY dist LIMIT 10;
274 This can be implemented quite efficiently by GiST indexes, but not by
277 Beginning in PostgreSQL 9.1, these index types also support index
278 searches for LIKE and ILIKE, for example
279 SELECT * FROM test_trgm WHERE t LIKE '%foo%bar';
281 The index search works by extracting trigrams from the search string
282 and then looking these up in the index. The more trigrams in the search
283 string, the more effective the index search is. Unlike B-tree based
284 searches, the search string need not be left-anchored.
286 Beginning in PostgreSQL 9.3, these index types also support index
287 searches for regular-expression matches (~ and ~* operators), for
289 SELECT * FROM test_trgm WHERE t ~ '(foo|bar)';
291 The index search works by extracting trigrams from the regular
292 expression and then looking these up in the index. The more trigrams
293 that can be extracted from the regular expression, the more effective
294 the index search is. Unlike B-tree based searches, the search string
295 need not be left-anchored.
297 For both LIKE and regular-expression searches, keep in mind that a
298 pattern with no extractable trigrams will degenerate to a full-index
301 The choice between GiST and GIN indexing depends on the relative
302 performance characteristics of GiST and GIN, which are discussed
305 F.35.5. Text Search Integration #
307 Trigram matching is a very useful tool when used in conjunction with a
308 full text index. In particular it can help to recognize misspelled
309 input words that will not be matched directly by the full text search
312 The first step is to generate an auxiliary table containing all the
313 unique words in the documents:
314 CREATE TABLE words AS SELECT word FROM
315 ts_stat('SELECT to_tsvector(''simple'', bodytext) FROM documents');
317 where documents is a table that has a text field bodytext that we wish
318 to search. The reason for using the simple configuration with the
319 to_tsvector function, instead of using a language-specific
320 configuration, is that we want a list of the original (unstemmed)
323 Next, create a trigram index on the word column:
324 CREATE INDEX words_idx ON words USING GIN (word gin_trgm_ops);
326 Now, a SELECT query similar to the previous example can be used to
327 suggest spellings for misspelled words in user search terms. A useful
328 extra test is to require that the selected words are also of similar
329 length to the misspelled word.
333 Since the words table has been generated as a separate, static table,
334 it will need to be periodically regenerated so that it remains
335 reasonably up-to-date with the document collection. Keeping it exactly
336 current is usually unnecessary.
340 GiST Development Site http://www.sai.msu.su/~megera/postgres/gist/
342 Tsearch2 Development Site
343 http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/
347 Oleg Bartunov <oleg@sai.msu.su>, Moscow, Moscow University, Russia
349 Teodor Sigaev <teodor@sigaev.ru>, Moscow, Delta-Soft Ltd.,Russia
351 Alexander Korotkov <a.korotkov@postgrespro.ru>, Moscow, Postgres
354 Documentation: Christopher Kings-Lynne
356 This module is sponsored by Delta-Soft Ltd., Moscow, Russia.