2 9.7. Pattern Matching #
5 9.7.2. SIMILAR TO Regular Expressions
6 9.7.3. POSIX Regular Expressions
8 There are three separate approaches to pattern matching provided by
9 PostgreSQL: the traditional SQL LIKE operator, the more recent SIMILAR
10 TO operator (added in SQL:1999), and POSIX-style regular expressions.
11 Aside from the basic “does this string match this pattern?” operators,
12 functions are available to extract or replace matching substrings and
13 to split a string at matching locations.
17 If you have pattern matching needs that go beyond this, consider
18 writing a user-defined function in Perl or Tcl.
22 While most regular-expression searches can be executed very quickly,
23 regular expressions can be contrived that take arbitrary amounts of
24 time and memory to process. Be wary of accepting regular-expression
25 search patterns from hostile sources. If you must do so, it is
26 advisable to impose a statement timeout.
28 Searches using SIMILAR TO patterns have the same security hazards,
29 since SIMILAR TO provides many of the same capabilities as POSIX-style
32 LIKE searches, being much simpler than the other two options, are safer
33 to use with possibly-hostile pattern sources.
35 SIMILAR TO and POSIX-style regular expressions do not support
36 nondeterministic collations. If required, use LIKE or apply a different
37 collation to the expression to work around this limitation.
41 string LIKE pattern [ESCAPE escape-character]
42 string NOT LIKE pattern [ESCAPE escape-character]
44 The LIKE expression returns true if the string matches the supplied
45 pattern. (As expected, the NOT LIKE expression returns false if LIKE
46 returns true, and vice versa. An equivalent expression is NOT (string
49 If pattern does not contain percent signs or underscores, then the
50 pattern only represents the string itself; in that case LIKE acts like
51 the equals operator. An underscore (_) in pattern stands for (matches)
52 any single character; a percent sign (%) matches any sequence of zero
61 LIKE pattern matching supports nondeterministic collations (see
62 Section 23.2.2.4), such as case-insensitive collations or collations
63 that, say, ignore punctuation. So with a case-insensitive collation,
65 'AbC' LIKE 'abc' COLLATE case_insensitive true
66 'AbC' LIKE 'a%' COLLATE case_insensitive true
68 With collations that ignore certain characters or in general that
69 consider strings of different lengths equal, the semantics can become a
70 bit more complicated. Consider these examples:
71 '.foo.' LIKE 'foo' COLLATE ign_punct true
72 '.foo.' LIKE 'f_o' COLLATE ign_punct true
73 '.foo.' LIKE '_oo' COLLATE ign_punct false
75 The way the matching works is that the pattern is partitioned into
76 sequences of wildcards and non-wildcard strings (wildcards being _ and
77 %). For example, the pattern f_o is partitioned into f, _, o, the
78 pattern _oo is partitioned into _, oo. The input string matches the
79 pattern if it can be partitioned in such a way that the wildcards match
80 one character or any number of characters respectively and the
81 non-wildcard partitions are equal under the applicable collation. So
82 for example, '.foo.' LIKE 'f_o' COLLATE ign_punct is true because one
83 can partition .foo. into .f, o, o., and then '.f' = 'f' COLLATE
84 ign_punct, 'o' matches the _ wildcard, and 'o.' = 'o' COLLATE
85 ign_punct. But '.foo.' LIKE '_oo' COLLATE ign_punct is false because
86 .foo. cannot be partitioned in a way that the first character is any
87 character and the rest of the string compares equal to oo. (Note that
88 the single-character wildcard always matches exactly one character,
89 independent of the collation. So in this example, the _ would match .,
90 but then the rest of the input string won't match the rest of the
93 LIKE pattern matching always covers the entire string. Therefore, if
94 it's desired to match a sequence anywhere within a string, the pattern
95 must start and end with a percent sign.
97 To match a literal underscore or percent sign without matching other
98 characters, the respective character in pattern must be preceded by the
99 escape character. The default escape character is the backslash but a
100 different one can be selected by using the ESCAPE clause. To match the
101 escape character itself, write two escape characters.
105 If you have standard_conforming_strings turned off, any backslashes you
106 write in literal string constants will need to be doubled. See
107 Section 4.1.2.1 for more information.
109 It's also possible to select no escape character by writing ESCAPE ''.
110 This effectively disables the escape mechanism, which makes it
111 impossible to turn off the special meaning of underscore and percent
112 signs in the pattern.
114 According to the SQL standard, omitting ESCAPE means there is no escape
115 character (rather than defaulting to a backslash), and a zero-length
116 ESCAPE value is disallowed. PostgreSQL's behavior in this regard is
117 therefore slightly nonstandard.
119 The key word ILIKE can be used instead of LIKE to make the match
120 case-insensitive according to the active locale. (But this does not
121 support nondeterministic collations.) This is not in the SQL standard
122 but is a PostgreSQL extension.
124 The operator ~~ is equivalent to LIKE, and ~~* corresponds to ILIKE.
125 There are also !~~ and !~~* operators that represent NOT LIKE and NOT
126 ILIKE, respectively. All of these operators are PostgreSQL-specific.
127 You may see these operator names in EXPLAIN output and similar places,
128 since the parser actually translates LIKE et al. to these operators.
130 The phrases LIKE, ILIKE, NOT LIKE, and NOT ILIKE are generally treated
131 as operators in PostgreSQL syntax; for example they can be used in
132 expression operator ANY (subquery) constructs, although an ESCAPE
133 clause cannot be included there. In some obscure cases it may be
134 necessary to use the underlying operator names instead.
136 Also see the starts-with operator ^@ and the corresponding
137 starts_with() function, which are useful in cases where simply matching
138 the beginning of a string is needed.
140 9.7.2. SIMILAR TO Regular Expressions #
142 string SIMILAR TO pattern [ESCAPE escape-character]
143 string NOT SIMILAR TO pattern [ESCAPE escape-character]
145 The SIMILAR TO operator returns true or false depending on whether its
146 pattern matches the given string. It is similar to LIKE, except that it
147 interprets the pattern using the SQL standard's definition of a regular
148 expression. SQL regular expressions are a curious cross between LIKE
149 notation and common (POSIX) regular expression notation.
151 Like LIKE, the SIMILAR TO operator succeeds only if its pattern matches
152 the entire string; this is unlike common regular expression behavior
153 where the pattern can match any part of the string. Also like LIKE,
154 SIMILAR TO uses _ and % as wildcard characters denoting any single
155 character and any string, respectively (these are comparable to . and
156 .* in POSIX regular expressions).
158 In addition to these facilities borrowed from LIKE, SIMILAR TO supports
159 these pattern-matching metacharacters borrowed from POSIX regular
161 * | denotes alternation (either of two alternatives).
162 * * denotes repetition of the previous item zero or more times.
163 * + denotes repetition of the previous item one or more times.
164 * ? denotes repetition of the previous item zero or one time.
165 * {m} denotes repetition of the previous item exactly m times.
166 * {m,} denotes repetition of the previous item m or more times.
167 * {m,n} denotes repetition of the previous item at least m and not
169 * Parentheses () can be used to group items into a single logical
171 * A bracket expression [...] specifies a character class, just as in
172 POSIX regular expressions.
174 Notice that the period (.) is not a metacharacter for SIMILAR TO.
176 As with LIKE, a backslash disables the special meaning of any of these
177 metacharacters. A different escape character can be specified with
178 ESCAPE, or the escape capability can be disabled by writing ESCAPE ''.
180 According to the SQL standard, omitting ESCAPE means there is no escape
181 character (rather than defaulting to a backslash), and a zero-length
182 ESCAPE value is disallowed. PostgreSQL's behavior in this regard is
183 therefore slightly nonstandard.
185 Another nonstandard extension is that following the escape character
186 with a letter or digit provides access to the escape sequences defined
187 for POSIX regular expressions; see Table 9.20, Table 9.21, and
191 'abc' SIMILAR TO 'abc' true
192 'abc' SIMILAR TO 'a' false
193 'abc' SIMILAR TO '%(b|d)%' true
194 'abc' SIMILAR TO '(b|c)%' false
195 '-abc-' SIMILAR TO '%\mabc\M%' true
196 'xabcy' SIMILAR TO '%\mabc\M%' false
198 The substring function with three parameters provides extraction of a
199 substring that matches an SQL regular expression pattern. The function
200 can be written according to standard SQL syntax:
201 substring(string similar pattern escape escape-character)
203 or using the now obsolete SQL:1999 syntax:
204 substring(string from pattern for escape-character)
206 or as a plain three-argument function:
207 substring(string, pattern, escape-character)
209 As with SIMILAR TO, the specified pattern must match the entire data
210 string, or else the function fails and returns null. To indicate the
211 part of the pattern for which the matching data sub-string is of
212 interest, the pattern should contain two occurrences of the escape
213 character followed by a double quote ("). The text matching the portion
214 of the pattern between these separators is returned when the match is
217 The escape-double-quote separators actually divide substring's pattern
218 into three independent regular expressions; for example, a vertical bar
219 (|) in any of the three sections affects only that section. Also, the
220 first and third of these regular expressions are defined to match the
221 smallest possible amount of text, not the largest, when there is any
222 ambiguity about how much of the data string matches which pattern. (In
223 POSIX parlance, the first and third regular expressions are forced to
226 As an extension to the SQL standard, PostgreSQL allows there to be just
227 one escape-double-quote separator, in which case the third regular
228 expression is taken as empty; or no separators, in which case the first
229 and third regular expressions are taken as empty.
231 Some examples, with #" delimiting the return string:
232 substring('foobar' similar '%#"o_b#"%' escape '#') oob
233 substring('foobar' similar '#"o_b#"%' escape '#') NULL
235 9.7.3. POSIX Regular Expressions #
237 Table 9.16 lists the available operators for pattern matching using
238 POSIX regular expressions.
240 Table 9.16. Regular Expression Match Operators
248 text ~ text → boolean
250 String matches regular expression, case sensitively
252 'thomas' ~ 't.*ma' → t
254 text ~* text → boolean
256 String matches regular expression, case-insensitively
258 'thomas' ~* 'T.*ma' → t
260 text !~ text → boolean
262 String does not match regular expression, case sensitively
264 'thomas' !~ 't.*max' → t
266 text !~* text → boolean
268 String does not match regular expression, case-insensitively
270 'thomas' !~* 'T.*ma' → f
272 POSIX regular expressions provide a more powerful means for pattern
273 matching than the LIKE and SIMILAR TO operators. Many Unix tools such
274 as egrep, sed, or awk use a pattern matching language that is similar
275 to the one described here.
277 A regular expression is a character sequence that is an abbreviated
278 definition of a set of strings (a regular set). A string is said to
279 match a regular expression if it is a member of the regular set
280 described by the regular expression. As with LIKE, pattern characters
281 match string characters exactly unless they are special characters in
282 the regular expression language — but regular expressions use different
283 special characters than LIKE does. Unlike LIKE patterns, a regular
284 expression is allowed to match anywhere within a string, unless the
285 regular expression is explicitly anchored to the beginning or end of
290 'abcd' ~ 'a.c' true — dot matches any character
291 'abcd' ~ 'a.*d' true — * repeats the preceding pattern item
292 'abcd' ~ '(b|x)' true — | means OR, parentheses group
293 'abcd' ~ '^a' true — ^ anchors to start of string
294 'abcd' ~ '^(b|c)' false — would match except for anchoring
296 The POSIX pattern language is described in much greater detail below.
298 The substring function with two parameters, substring(string from
299 pattern), provides extraction of a substring that matches a POSIX
300 regular expression pattern. It returns null if there is no match,
301 otherwise the first portion of the text that matched the pattern. But
302 if the pattern contains any parentheses, the portion of the text that
303 matched the first parenthesized subexpression (the one whose left
304 parenthesis comes first) is returned. You can put parentheses around
305 the whole expression if you want to use parentheses within it without
306 triggering this exception. If you need parentheses in the pattern
307 before the subexpression you want to extract, see the non-capturing
308 parentheses described below.
311 substring('foobar' from 'o.b') oob
312 substring('foobar' from 'o(.)b') o
314 The regexp_count function counts the number of places where a POSIX
315 regular expression pattern matches a string. It has the syntax
316 regexp_count(string, pattern [, start [, flags ]]). pattern is searched
317 for in string, normally from the beginning of the string, but if the
318 start parameter is provided then beginning from that character index.
319 The flags parameter is an optional text string containing zero or more
320 single-letter flags that change the function's behavior. For example,
321 including i in flags specifies case-insensitive matching. Supported
322 flags are described in Table 9.24.
325 regexp_count('ABCABCAXYaxy', 'A.') 3
326 regexp_count('ABCABCAXYaxy', 'A.', 1, 'i') 4
328 The regexp_instr function returns the starting or ending position of
329 the N'th match of a POSIX regular expression pattern to a string, or
330 zero if there is no such match. It has the syntax regexp_instr(string,
331 pattern [, start [, N [, endoption [, flags [, subexpr ]]]]]). pattern
332 is searched for in string, normally from the beginning of the string,
333 but if the start parameter is provided then beginning from that
334 character index. If N is specified then the N'th match of the pattern
335 is located, otherwise the first match is located. If the endoption
336 parameter is omitted or specified as zero, the function returns the
337 position of the first character of the match. Otherwise, endoption must
338 be one, and the function returns the position of the character
339 following the match. The flags parameter is an optional text string
340 containing zero or more single-letter flags that change the function's
341 behavior. Supported flags are described in Table 9.24. For a pattern
342 containing parenthesized subexpressions, subexpr is an integer
343 indicating which subexpression is of interest: the result identifies
344 the position of the substring matching that subexpression.
345 Subexpressions are numbered in the order of their leading parentheses.
346 When subexpr is omitted or zero, the result identifies the position of
347 the whole match regardless of parenthesized subexpressions.
350 regexp_instr('number of your street, town zip, FR', '[^,]+', 1, 2)
352 regexp_instr(string=>'ABCDEFGHI', pattern=>'(c..)(...)', start=>1, "N"=>1, endop
353 tion=>0, flags=>'i', subexpr=>2)
356 The regexp_like function checks whether a match of a POSIX regular
357 expression pattern occurs within a string, returning boolean true or
358 false. It has the syntax regexp_like(string, pattern [, flags ]). The
359 flags parameter is an optional text string containing zero or more
360 single-letter flags that change the function's behavior. Supported
361 flags are described in Table 9.24. This function has the same results
362 as the ~ operator if no flags are specified. If only the i flag is
363 specified, it has the same results as the ~* operator.
366 regexp_like('Hello World', 'world') false
367 regexp_like('Hello World', 'world', 'i') true
369 The regexp_match function returns a text array of matching substring(s)
370 within the first match of a POSIX regular expression pattern to a
371 string. It has the syntax regexp_match(string, pattern [, flags ]). If
372 there is no match, the result is NULL. If a match is found, and the
373 pattern contains no parenthesized subexpressions, then the result is a
374 single-element text array containing the substring matching the whole
375 pattern. If a match is found, and the pattern contains parenthesized
376 subexpressions, then the result is a text array whose n'th element is
377 the substring matching the n'th parenthesized subexpression of the
378 pattern (not counting “non-capturing” parentheses; see below for
379 details). The flags parameter is an optional text string containing
380 zero or more single-letter flags that change the function's behavior.
381 Supported flags are described in Table 9.24.
384 SELECT regexp_match('foobarbequebaz', 'bar.*que');
390 SELECT regexp_match('foobarbequebaz', '(bar)(beque)');
398 In the common case where you just want the whole matching substring or
399 NULL for no match, the best solution is to use regexp_substr().
400 However, regexp_substr() only exists in PostgreSQL version 15 and up.
401 When working in older versions, you can extract the first element of
402 regexp_match()'s result, for example:
403 SELECT (regexp_match('foobarbequebaz', 'bar.*que'))[1];
409 The regexp_matches function returns a set of text arrays of matching
410 substring(s) within matches of a POSIX regular expression pattern to a
411 string. It has the same syntax as regexp_match. This function returns
412 no rows if there is no match, one row if there is a match and the g
413 flag is not given, or N rows if there are N matches and the g flag is
414 given. Each returned row is a text array containing the whole matched
415 substring or the substrings matching parenthesized subexpressions of
416 the pattern, just as described above for regexp_match. regexp_matches
417 accepts all the flags shown in Table 9.24, plus the g flag which
418 commands it to return all matches, not just the first one.
421 SELECT regexp_matches('foo', 'not there');
426 SELECT regexp_matches('foobarbequebazilbarfbonk', '(b[^b]+)(b[^b]+)', 'g');
435 In most cases regexp_matches() should be used with the g flag, since if
436 you only want the first match, it's easier and more efficient to use
437 regexp_match(). However, regexp_match() only exists in PostgreSQL
438 version 10 and up. When working in older versions, a common trick is to
439 place a regexp_matches() call in a sub-select, for example:
440 SELECT col1, (SELECT regexp_matches(col2, '(bar)(beque)')) FROM tab;
442 This produces a text array if there's a match, or NULL if not, the same
443 as regexp_match() would do. Without the sub-select, this query would
444 produce no output at all for table rows without a match, which is
445 typically not the desired behavior.
447 The regexp_replace function provides substitution of new text for
448 substrings that match POSIX regular expression patterns. It has the
449 syntax regexp_replace(string, pattern, replacement [, flags ]) or
450 regexp_replace(string, pattern, replacement, start [, N [, flags ]]).
451 The source string is returned unchanged if there is no match to the
452 pattern. If there is a match, the string is returned with the
453 replacement string substituted for the matching substring. The
454 replacement string can contain \n, where n is 1 through 9, to indicate
455 that the source substring matching the n'th parenthesized subexpression
456 of the pattern should be inserted, and it can contain \& to indicate
457 that the substring matching the entire pattern should be inserted.
458 Write \\ if you need to put a literal backslash in the replacement
459 text. pattern is searched for in string, normally from the beginning of
460 the string, but if the start parameter is provided then beginning from
461 that character index. By default, only the first match of the pattern
462 is replaced. If N is specified and is greater than zero, then the N'th
463 match of the pattern is replaced. If the g flag is given, or if N is
464 specified and is zero, then all matches at or after the start position
465 are replaced. (The g flag is ignored when N is specified.) The flags
466 parameter is an optional text string containing zero or more
467 single-letter flags that change the function's behavior. Supported
468 flags (though not g) are described in Table 9.24.
471 regexp_replace('foobarbaz', 'b..', 'X')
473 regexp_replace('foobarbaz', 'b..', 'X', 'g')
475 regexp_replace('foobarbaz', 'b(..)', 'X\1Y', 'g')
477 regexp_replace('A PostgreSQL function', 'a|e|i|o|u', 'X', 1, 0, 'i')
478 X PXstgrXSQL fXnctXXn
479 regexp_replace(string=>'A PostgreSQL function', pattern=>'a|e|i|o|u', replacemen
480 t=>'X', start=>1, "N"=>3, flags=>'i')
481 A PostgrXSQL function
483 The regexp_split_to_table function splits a string using a POSIX
484 regular expression pattern as a delimiter. It has the syntax
485 regexp_split_to_table(string, pattern [, flags ]). If there is no match
486 to the pattern, the function returns the string. If there is at least
487 one match, for each match it returns the text from the end of the last
488 match (or the beginning of the string) to the beginning of the match.
489 When there are no more matches, it returns the text from the end of the
490 last match to the end of the string. The flags parameter is an optional
491 text string containing zero or more single-letter flags that change the
492 function's behavior. regexp_split_to_table supports the flags described
495 The regexp_split_to_array function behaves the same as
496 regexp_split_to_table, except that regexp_split_to_array returns its
497 result as an array of text. It has the syntax
498 regexp_split_to_array(string, pattern [, flags ]). The parameters are
499 the same as for regexp_split_to_table.
502 SELECT foo FROM regexp_split_to_table('the quick brown fox jumps over the lazy d
517 SELECT regexp_split_to_array('the quick brown fox jumps over the lazy dog', '\s+
519 regexp_split_to_array
520 -----------------------------------------------
521 {the,quick,brown,fox,jumps,over,the,lazy,dog}
524 SELECT foo FROM regexp_split_to_table('the quick brown fox', '\s*') AS foo;
545 As the last example demonstrates, the regexp split functions ignore
546 zero-length matches that occur at the start or end of the string or
547 immediately after a previous match. This is contrary to the strict
548 definition of regexp matching that is implemented by the other regexp
549 functions, but is usually the most convenient behavior in practice.
550 Other software systems such as Perl use similar definitions.
552 The regexp_substr function returns the substring that matches a POSIX
553 regular expression pattern, or NULL if there is no match. It has the
554 syntax regexp_substr(string, pattern [, start [, N [, flags [, subexpr
555 ]]]]). pattern is searched for in string, normally from the beginning
556 of the string, but if the start parameter is provided then beginning
557 from that character index. If N is specified then the N'th match of the
558 pattern is returned, otherwise the first match is returned. The flags
559 parameter is an optional text string containing zero or more
560 single-letter flags that change the function's behavior. Supported
561 flags are described in Table 9.24. For a pattern containing
562 parenthesized subexpressions, subexpr is an integer indicating which
563 subexpression is of interest: the result is the substring matching that
564 subexpression. Subexpressions are numbered in the order of their
565 leading parentheses. When subexpr is omitted or zero, the result is the
566 whole match regardless of parenthesized subexpressions.
569 regexp_substr('number of your street, town zip, FR', '[^,]+', 1, 2)
571 regexp_substr('ABCDEFGHI', '(c..)(...)', 1, 1, 'i', 2)
574 9.7.3.1. Regular Expression Details #
576 PostgreSQL's regular expressions are implemented using a software
577 package written by Henry Spencer. Much of the description of regular
578 expressions below is copied verbatim from his manual.
580 Regular expressions (REs), as defined in POSIX 1003.2, come in two
581 forms: extended REs or EREs (roughly those of egrep), and basic REs or
582 BREs (roughly those of ed). PostgreSQL supports both forms, and also
583 implements some extensions that are not in the POSIX standard, but have
584 become widely used due to their availability in programming languages
585 such as Perl and Tcl. REs using these non-POSIX extensions are called
586 advanced REs or AREs in this documentation. AREs are almost an exact
587 superset of EREs, but BREs have several notational incompatibilities
588 (as well as being much more limited). We first describe the ARE and ERE
589 forms, noting features that apply only to AREs, and then describe how
594 PostgreSQL always initially presumes that a regular expression follows
595 the ARE rules. However, the more limited ERE or BRE rules can be chosen
596 by prepending an embedded option to the RE pattern, as described in
597 Section 9.7.3.4. This can be useful for compatibility with applications
598 that expect exactly the POSIX 1003.2 rules.
600 A regular expression is defined as one or more branches, separated by
601 |. It matches anything that matches one of the branches.
603 A branch is zero or more quantified atoms or constraints, concatenated.
604 It matches a match for the first, followed by a match for the second,
605 etc.; an empty branch matches the empty string.
607 A quantified atom is an atom possibly followed by a single quantifier.
608 Without a quantifier, it matches a match for the atom. With a
609 quantifier, it can match some number of matches of the atom. An atom
610 can be any of the possibilities shown in Table 9.17. The possible
611 quantifiers and their meanings are shown in Table 9.18.
613 A constraint matches an empty string, but matches only when specific
614 conditions are met. A constraint can be used where an atom could be
615 used, except it cannot be followed by a quantifier. The simple
616 constraints are shown in Table 9.19; some more constraints are
619 Table 9.17. Regular Expression Atoms
621 (re) (where re is any regular expression) matches a match for re, with
622 the match noted for possible reporting
623 (?:re) as above, but the match is not noted for reporting (a
624 “non-capturing” set of parentheses) (AREs only)
625 . matches any single character
626 [chars] a bracket expression, matching any one of the chars (see
627 Section 9.7.3.2 for more detail)
628 \k (where k is a non-alphanumeric character) matches that character
629 taken as an ordinary character, e.g., \\ matches a backslash character
630 \c where c is alphanumeric (possibly followed by other characters) is
631 an escape, see Section 9.7.3.3 (AREs only; in EREs and BREs, this
633 { when followed by a character other than a digit, matches the
634 left-brace character {; when followed by a digit, it is the beginning
635 of a bound (see below)
636 x where x is a single character with no other significance, matches
639 An RE cannot end with a backslash (\).
643 If you have standard_conforming_strings turned off, any backslashes you
644 write in literal string constants will need to be doubled. See
645 Section 4.1.2.1 for more information.
647 Table 9.18. Regular Expression Quantifiers
649 * a sequence of 0 or more matches of the atom
650 + a sequence of 1 or more matches of the atom
651 ? a sequence of 0 or 1 matches of the atom
652 {m} a sequence of exactly m matches of the atom
653 {m,} a sequence of m or more matches of the atom
654 {m,n} a sequence of m through n (inclusive) matches of the atom; m
656 *? non-greedy version of *
657 +? non-greedy version of +
658 ?? non-greedy version of ?
659 {m}? non-greedy version of {m}
660 {m,}? non-greedy version of {m,}
661 {m,n}? non-greedy version of {m,n}
663 The forms using {...} are known as bounds. The numbers m and n within a
664 bound are unsigned decimal integers with permissible values from 0 to
667 Non-greedy quantifiers (available in AREs only) match the same
668 possibilities as their corresponding normal (greedy) counterparts, but
669 prefer the smallest number rather than the largest number of matches.
670 See Section 9.7.3.5 for more detail.
674 A quantifier cannot immediately follow another quantifier, e.g., ** is
675 invalid. A quantifier cannot begin an expression or subexpression or
678 Table 9.19. Regular Expression Constraints
679 Constraint Description
680 ^ matches at the beginning of the string
681 $ matches at the end of the string
682 (?=re) positive lookahead matches at any point where a substring
683 matching re begins (AREs only)
684 (?!re) negative lookahead matches at any point where no substring
685 matching re begins (AREs only)
686 (?<=re) positive lookbehind matches at any point where a substring
687 matching re ends (AREs only)
688 (?<!re) negative lookbehind matches at any point where no substring
689 matching re ends (AREs only)
691 Lookahead and lookbehind constraints cannot contain back references
692 (see Section 9.7.3.3), and all parentheses within them are considered
695 9.7.3.2. Bracket Expressions #
697 A bracket expression is a list of characters enclosed in []. It
698 normally matches any single character from the list (but see below). If
699 the list begins with ^, it matches any single character not from the
700 rest of the list. If two characters in the list are separated by -,
701 this is shorthand for the full range of characters between those two
702 (inclusive) in the collating sequence, e.g., [0-9] in ASCII matches any
703 decimal digit. It is illegal for two ranges to share an endpoint, e.g.,
704 a-c-e. Ranges are very collating-sequence-dependent, so portable
705 programs should avoid relying on them.
707 To include a literal ] in the list, make it the first character (after
708 ^, if that is used). To include a literal -, make it the first or last
709 character, or the second endpoint of a range. To use a literal - as the
710 first endpoint of a range, enclose it in [. and .] to make it a
711 collating element (see below). With the exception of these characters,
712 some combinations using [ (see next paragraphs), and escapes (AREs
713 only), all other special characters lose their special significance
714 within a bracket expression. In particular, \ is not special when
715 following ERE or BRE rules, though it is special (as introducing an
718 Within a bracket expression, a collating element (a character, a
719 multiple-character sequence that collates as if it were a single
720 character, or a collating-sequence name for either) enclosed in [. and
721 .] stands for the sequence of characters of that collating element. The
722 sequence is treated as a single element of the bracket expression's
723 list. This allows a bracket expression containing a multiple-character
724 collating element to match more than one character, e.g., if the
725 collating sequence includes a ch collating element, then the RE
726 [[.ch.]]*c matches the first five characters of chchcc.
730 PostgreSQL currently does not support multi-character collating
731 elements. This information describes possible future behavior.
733 Within a bracket expression, a collating element enclosed in [= and =]
734 is an equivalence class, standing for the sequences of characters of
735 all collating elements equivalent to that one, including itself. (If
736 there are no other equivalent collating elements, the treatment is as
737 if the enclosing delimiters were [. and .].) For example, if o and ^
738 are the members of an equivalence class, then [[=o=]], [[=^=]], and
739 [o^] are all synonymous. An equivalence class cannot be an endpoint of
742 Within a bracket expression, the name of a character class enclosed in
743 [: and :] stands for the list of all characters belonging to that
744 class. A character class cannot be used as an endpoint of a range. The
745 POSIX standard defines these character class names: alnum (letters and
746 numeric digits), alpha (letters), blank (space and tab), cntrl (control
747 characters), digit (numeric digits), graph (printable characters except
748 space), lower (lower-case letters), print (printable characters
749 including space), punct (punctuation), space (any white space), upper
750 (upper-case letters), and xdigit (hexadecimal digits). The behavior of
751 these standard character classes is generally consistent across
752 platforms for characters in the 7-bit ASCII set. Whether a given
753 non-ASCII character is considered to belong to one of these classes
754 depends on the collation that is used for the regular-expression
755 function or operator (see Section 23.2), or by default on the
756 database's LC_CTYPE locale setting (see Section 23.1). The
757 classification of non-ASCII characters can vary across platforms even
758 in similarly-named locales. (But the C locale never considers any
759 non-ASCII characters to belong to any of these classes.) In addition to
760 these standard character classes, PostgreSQL defines the word character
761 class, which is the same as alnum plus the underscore (_) character,
762 and the ascii character class, which contains exactly the 7-bit ASCII
765 There are two special cases of bracket expressions: the bracket
766 expressions [[:<:]] and [[:>:]] are constraints, matching empty strings
767 at the beginning and end of a word respectively. A word is defined as a
768 sequence of word characters that is neither preceded nor followed by
769 word characters. A word character is any character belonging to the
770 word character class, that is, any letter, digit, or underscore. This
771 is an extension, compatible with but not specified by POSIX 1003.2, and
772 should be used with caution in software intended to be portable to
773 other systems. The constraint escapes described below are usually
774 preferable; they are no more standard, but are easier to type.
776 9.7.3.3. Regular Expression Escapes #
778 Escapes are special sequences beginning with \ followed by an
779 alphanumeric character. Escapes come in several varieties: character
780 entry, class shorthands, constraint escapes, and back references. A \
781 followed by an alphanumeric character but not constituting a valid
782 escape is illegal in AREs. In EREs, there are no escapes: outside a
783 bracket expression, a \ followed by an alphanumeric character merely
784 stands for that character as an ordinary character, and inside a
785 bracket expression, \ is an ordinary character. (The latter is the one
786 actual incompatibility between EREs and AREs.)
788 Character-entry escapes exist to make it easier to specify non-printing
789 and other inconvenient characters in REs. They are shown in Table 9.20.
791 Class-shorthand escapes provide shorthands for certain commonly-used
792 character classes. They are shown in Table 9.21.
794 A constraint escape is a constraint, matching the empty string if
795 specific conditions are met, written as an escape. They are shown in
798 A back reference (\n) matches the same string matched by the previous
799 parenthesized subexpression specified by the number n (see Table 9.23).
800 For example, ([bc])\1 matches bb or cc but not bc or cb. The
801 subexpression must entirely precede the back reference in the RE.
802 Subexpressions are numbered in the order of their leading parentheses.
803 Non-capturing parentheses do not define subexpressions. The back
804 reference considers only the string characters matched by the
805 referenced subexpression, not any constraints contained in it. For
806 example, (^\d)\1 will match 22.
808 Table 9.20. Regular Expression Character-Entry Escapes
810 \a alert (bell) character, as in C
811 \b backspace, as in C
812 \B synonym for backslash (\) to help reduce the need for backslash
814 \cX (where X is any character) the character whose low-order 5 bits are
815 the same as those of X, and whose other bits are all zero
816 \e the character whose collating-sequence name is ESC, or failing that,
817 the character with octal value 033
818 \f form feed, as in C
820 \r carriage return, as in C
821 \t horizontal tab, as in C
822 \uwxyz (where wxyz is exactly four hexadecimal digits) the character
823 whose hexadecimal value is 0xwxyz
824 \Ustuvwxyz (where stuvwxyz is exactly eight hexadecimal digits) the
825 character whose hexadecimal value is 0xstuvwxyz
826 \v vertical tab, as in C
827 \xhhh (where hhh is any sequence of hexadecimal digits) the character
828 whose hexadecimal value is 0xhhh (a single character no matter how many
829 hexadecimal digits are used)
830 \0 the character whose value is 0 (the null byte)
831 \xy (where xy is exactly two octal digits, and is not a back reference)
832 the character whose octal value is 0xy
833 \xyz (where xyz is exactly three octal digits, and is not a back
834 reference) the character whose octal value is 0xyz
836 Hexadecimal digits are 0-9, a-f, and A-F. Octal digits are 0-7.
838 Numeric character-entry escapes specifying values outside the ASCII
839 range (0–127) have meanings dependent on the database encoding. When
840 the encoding is UTF-8, escape values are equivalent to Unicode code
841 points, for example \u1234 means the character U+1234. For other
842 multibyte encodings, character-entry escapes usually just specify the
843 concatenation of the byte values for the character. If the escape value
844 does not correspond to any legal character in the database encoding, no
845 error will be raised, but it will never match any data.
847 The character-entry escapes are always taken as ordinary characters.
848 For example, \135 is ] in ASCII, but \135 does not terminate a bracket
851 Table 9.21. Regular Expression Class-Shorthand Escapes
853 \d matches any digit, like [[:digit:]]
854 \s matches any whitespace character, like [[:space:]]
855 \w matches any word character, like [[:word:]]
856 \D matches any non-digit, like [^[:digit:]]
857 \S matches any non-whitespace character, like [^[:space:]]
858 \W matches any non-word character, like [^[:word:]]
860 The class-shorthand escapes also work within bracket expressions,
861 although the definitions shown above are not quite syntactically valid
862 in that context. For example, [a-c\d] is equivalent to [a-c[:digit:]].
864 Table 9.22. Regular Expression Constraint Escapes
866 \A matches only at the beginning of the string (see Section 9.7.3.5 for
867 how this differs from ^)
868 \m matches only at the beginning of a word
869 \M matches only at the end of a word
870 \y matches only at the beginning or end of a word
871 \Y matches only at a point that is not the beginning or end of a word
872 \Z matches only at the end of the string (see Section 9.7.3.5 for how
875 A word is defined as in the specification of [[:<:]] and [[:>:]] above.
876 Constraint escapes are illegal within bracket expressions.
878 Table 9.23. Regular Expression Back References
880 \m (where m is a nonzero digit) a back reference to the m'th
882 \mnn (where m is a nonzero digit, and nn is some more digits, and the
883 decimal value mnn is not greater than the number of closing capturing
884 parentheses seen so far) a back reference to the mnn'th subexpression
888 There is an inherent ambiguity between octal character-entry escapes
889 and back references, which is resolved by the following heuristics, as
890 hinted at above. A leading zero always indicates an octal escape. A
891 single non-zero digit, not followed by another digit, is always taken
892 as a back reference. A multi-digit sequence not starting with a zero is
893 taken as a back reference if it comes after a suitable subexpression
894 (i.e., the number is in the legal range for a back reference), and
895 otherwise is taken as octal.
897 9.7.3.4. Regular Expression Metasyntax #
899 In addition to the main syntax described above, there are some special
900 forms and miscellaneous syntactic facilities available.
902 An RE can begin with one of two special director prefixes. If an RE
903 begins with ***:, the rest of the RE is taken as an ARE. (This normally
904 has no effect in PostgreSQL, since REs are assumed to be AREs; but it
905 does have an effect if ERE or BRE mode had been specified by the flags
906 parameter to a regex function.) If an RE begins with ***=, the rest of
907 the RE is taken to be a literal string, with all characters considered
910 An ARE can begin with embedded options: a sequence (?xyz) (where xyz is
911 one or more alphabetic characters) specifies options affecting the rest
912 of the RE. These options override any previously determined options —
913 in particular, they can override the case-sensitivity behavior implied
914 by a regex operator, or the flags parameter to a regex function. The
915 available option letters are shown in Table 9.24. Note that these same
916 option letters are used in the flags parameters of regex functions.
918 Table 9.24. ARE Embedded-Option Letters
920 b rest of RE is a BRE
921 c case-sensitive matching (overrides operator type)
922 e rest of RE is an ERE
923 i case-insensitive matching (see Section 9.7.3.5) (overrides operator
925 m historical synonym for n
926 n newline-sensitive matching (see Section 9.7.3.5)
927 p partial newline-sensitive matching (see Section 9.7.3.5)
928 q rest of RE is a literal (“quoted”) string, all ordinary characters
929 s non-newline-sensitive matching (default)
930 t tight syntax (default; see below)
931 w inverse partial newline-sensitive (“weird”) matching (see
933 x expanded syntax (see below)
935 Embedded options take effect at the ) terminating the sequence. They
936 can appear only at the start of an ARE (after the ***: director if
939 In addition to the usual (tight) RE syntax, in which all characters are
940 significant, there is an expanded syntax, available by specifying the
941 embedded x option. In the expanded syntax, white-space characters in
942 the RE are ignored, as are all characters between a # and the following
943 newline (or the end of the RE). This permits paragraphing and
944 commenting a complex RE. There are three exceptions to that basic rule:
945 * a white-space character or # preceded by \ is retained
946 * white space or # within a bracket expression is retained
947 * white space and comments cannot appear within multi-character
950 For this purpose, white-space characters are blank, tab, newline, and
951 any character that belongs to the space character class.
953 Finally, in an ARE, outside bracket expressions, the sequence (?#ttt)
954 (where ttt is any text not containing a )) is a comment, completely
955 ignored. Again, this is not allowed between the characters of
956 multi-character symbols, like (?:. Such comments are more a historical
957 artifact than a useful facility, and their use is deprecated; use the
958 expanded syntax instead.
960 None of these metasyntax extensions is available if an initial ***=
961 director has specified that the user's input be treated as a literal
962 string rather than as an RE.
964 9.7.3.5. Regular Expression Matching Rules #
966 In the event that an RE could match more than one substring of a given
967 string, the RE matches the one starting earliest in the string. If the
968 RE could match more than one substring starting at that point, either
969 the longest possible match or the shortest possible match will be
970 taken, depending on whether the RE is greedy or non-greedy.
972 Whether an RE is greedy or not is determined by the following rules:
973 * Most atoms, and all constraints, have no greediness attribute
974 (because they cannot match variable amounts of text anyway).
975 * Adding parentheses around an RE does not change its greediness.
976 * A quantified atom with a fixed-repetition quantifier ({m} or {m}?)
977 has the same greediness (possibly none) as the atom itself.
978 * A quantified atom with other normal quantifiers (including {m,n}
979 with m equal to n) is greedy (prefers longest match).
980 * A quantified atom with a non-greedy quantifier (including {m,n}?
981 with m equal to n) is non-greedy (prefers shortest match).
982 * A branch — that is, an RE that has no top-level | operator — has
983 the same greediness as the first quantified atom in it that has a
984 greediness attribute.
985 * An RE consisting of two or more branches connected by the |
986 operator is always greedy.
988 The above rules associate greediness attributes not only with
989 individual quantified atoms, but with branches and entire REs that
990 contain quantified atoms. What that means is that the matching is done
991 in such a way that the branch, or whole RE, matches the longest or
992 shortest possible substring as a whole. Once the length of the entire
993 match is determined, the part of it that matches any particular
994 subexpression is determined on the basis of the greediness attribute of
995 that subexpression, with subexpressions starting earlier in the RE
996 taking priority over ones starting later.
998 An example of what this means:
999 SELECT SUBSTRING('XY1234Z', 'Y*([0-9]{1,3})');
1001 SELECT SUBSTRING('XY1234Z', 'Y*?([0-9]{1,3})');
1004 In the first case, the RE as a whole is greedy because Y* is greedy. It
1005 can match beginning at the Y, and it matches the longest possible
1006 string starting there, i.e., Y123. The output is the parenthesized part
1007 of that, or 123. In the second case, the RE as a whole is non-greedy
1008 because Y*? is non-greedy. It can match beginning at the Y, and it
1009 matches the shortest possible string starting there, i.e., Y1. The
1010 subexpression [0-9]{1,3} is greedy but it cannot change the decision as
1011 to the overall match length; so it is forced to match just 1.
1013 In short, when an RE contains both greedy and non-greedy
1014 subexpressions, the total match length is either as long as possible or
1015 as short as possible, according to the attribute assigned to the whole
1016 RE. The attributes assigned to the subexpressions only affect how much
1017 of that match they are allowed to “eat” relative to each other.
1019 The quantifiers {1,1} and {1,1}? can be used to force greediness or
1020 non-greediness, respectively, on a subexpression or a whole RE. This is
1021 useful when you need the whole RE to have a greediness attribute
1022 different from what's deduced from its elements. As an example, suppose
1023 that we are trying to separate a string containing some digits into the
1024 digits and the parts before and after them. We might try to do that
1026 SELECT regexp_match('abc01234xyz', '(.*)(\d+)(.*)');
1027 Result: {abc0123,4,xyz}
1029 That didn't work: the first .* is greedy so it “eats” as much as it
1030 can, leaving the \d+ to match at the last possible place, the last
1031 digit. We might try to fix that by making it non-greedy:
1032 SELECT regexp_match('abc01234xyz', '(.*?)(\d+)(.*)');
1035 That didn't work either, because now the RE as a whole is non-greedy
1036 and so it ends the overall match as soon as possible. We can get what
1037 we want by forcing the RE as a whole to be greedy:
1038 SELECT regexp_match('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}');
1039 Result: {abc,01234,xyz}
1041 Controlling the RE's overall greediness separately from its components'
1042 greediness allows great flexibility in handling variable-length
1045 When deciding what is a longer or shorter match, match lengths are
1046 measured in characters, not collating elements. An empty string is
1047 considered longer than no match at all. For example: bb* matches the
1048 three middle characters of abbbc; (week|wee)(night|knights) matches all
1049 ten characters of weeknights; when (.*).* is matched against abc the
1050 parenthesized subexpression matches all three characters; and when
1051 (a*)* is matched against bc both the whole RE and the parenthesized
1052 subexpression match an empty string.
1054 If case-independent matching is specified, the effect is much as if all
1055 case distinctions had vanished from the alphabet. When an alphabetic
1056 that exists in multiple cases appears as an ordinary character outside
1057 a bracket expression, it is effectively transformed into a bracket
1058 expression containing both cases, e.g., x becomes [xX]. When it appears
1059 inside a bracket expression, all case counterparts of it are added to
1060 the bracket expression, e.g., [x] becomes [xX] and [^x] becomes [^xX].
1062 If newline-sensitive matching is specified, . and bracket expressions
1063 using ^ will never match the newline character (so that matches will
1064 not cross lines unless the RE explicitly includes a newline) and ^ and
1065 $ will match the empty string after and before a newline respectively,
1066 in addition to matching at beginning and end of string respectively.
1067 But the ARE escapes \A and \Z continue to match beginning or end of
1068 string only. Also, the character class shorthands \D and \W will match
1069 a newline regardless of this mode. (Before PostgreSQL 14, they did not
1070 match newlines when in newline-sensitive mode. Write [^[:digit:]] or
1071 [^[:word:]] to get the old behavior.)
1073 If partial newline-sensitive matching is specified, this affects . and
1074 bracket expressions as with newline-sensitive matching, but not ^ and
1077 If inverse partial newline-sensitive matching is specified, this
1078 affects ^ and $ as with newline-sensitive matching, but not . and
1079 bracket expressions. This isn't very useful but is provided for
1082 9.7.3.6. Limits and Compatibility #
1084 No particular limit is imposed on the length of REs in this
1085 implementation. However, programs intended to be highly portable should
1086 not employ REs longer than 256 bytes, as a POSIX-compliant
1087 implementation can refuse to accept such REs.
1089 The only feature of AREs that is actually incompatible with POSIX EREs
1090 is that \ does not lose its special significance inside bracket
1091 expressions. All other ARE features use syntax which is illegal or has
1092 undefined or unspecified effects in POSIX EREs; the *** syntax of
1093 directors likewise is outside the POSIX syntax for both BREs and EREs.
1095 Many of the ARE extensions are borrowed from Perl, but some have been
1096 changed to clean them up, and a few Perl extensions are not present.
1097 Incompatibilities of note include \b, \B, the lack of special treatment
1098 for a trailing newline, the addition of complemented bracket
1099 expressions to the things affected by newline-sensitive matching, the
1100 restrictions on parentheses and back references in lookahead/lookbehind
1101 constraints, and the longest/shortest-match (rather than first-match)
1104 9.7.3.7. Basic Regular Expressions #
1106 BREs differ from EREs in several respects. In BREs, |, +, and ? are
1107 ordinary characters and there is no equivalent for their functionality.
1108 The delimiters for bounds are \{ and \}, with { and } by themselves
1109 ordinary characters. The parentheses for nested subexpressions are \(
1110 and \), with ( and ) by themselves ordinary characters. ^ is an
1111 ordinary character except at the beginning of the RE or the beginning
1112 of a parenthesized subexpression, $ is an ordinary character except at
1113 the end of the RE or the end of a parenthesized subexpression, and * is
1114 an ordinary character if it appears at the beginning of the RE or the
1115 beginning of a parenthesized subexpression (after a possible leading
1116 ^). Finally, single-digit back references are available, and \< and \>
1117 are synonyms for [[:<:]] and [[:>:]] respectively; no other escapes are
1120 9.7.3.8. Differences from SQL Standard and XQuery #
1122 Since SQL:2008, the SQL standard includes regular expression operators
1123 and functions that performs pattern matching according to the XQuery
1124 regular expression standard:
1131 PostgreSQL does not currently implement these operators and functions.
1132 You can get approximately equivalent functionality in each case as
1133 shown in Table 9.25. (Various optional clauses on both sides have been
1134 omitted in this table.)
1136 Table 9.25. Regular Expression Functions Equivalencies
1137 SQL standard PostgreSQL
1138 string LIKE_REGEX pattern regexp_like(string, pattern) or string ~
1140 OCCURRENCES_REGEX(pattern IN string) regexp_count(string, pattern)
1141 POSITION_REGEX(pattern IN string) regexp_instr(string, pattern)
1142 SUBSTRING_REGEX(pattern IN string) regexp_substr(string, pattern)
1143 TRANSLATE_REGEX(pattern IN string WITH replacement)
1144 regexp_replace(string, pattern, replacement)
1146 Regular expression functions similar to those provided by PostgreSQL
1147 are also available in a number of other SQL implementations, whereas
1148 the SQL-standard functions are not as widely implemented. Some of the
1149 details of the regular expression syntax will likely differ in each
1152 The SQL-standard operators and functions use XQuery regular
1153 expressions, which are quite close to the ARE syntax described above.
1154 Notable differences between the existing POSIX-based regular-expression
1155 feature and XQuery regular expressions include:
1156 * XQuery character class subtraction is not supported. An example of
1157 this feature is using the following to match only English
1158 consonants: [a-z-[aeiou]].
1159 * XQuery character class shorthands \c, \C, \i, and \I are not
1161 * XQuery character class elements using \p{UnicodeProperty} or the
1162 inverse \P{UnicodeProperty} are not supported.
1163 * POSIX interprets character classes such as \w (see Table 9.21)
1164 according to the prevailing locale (which you can control by
1165 attaching a COLLATE clause to the operator or function). XQuery
1166 specifies these classes by reference to Unicode character
1167 properties, so equivalent behavior is obtained only with a locale
1168 that follows the Unicode rules.
1169 * The SQL standard (not XQuery itself) attempts to cater for more
1170 variants of “newline” than POSIX does. The newline-sensitive
1171 matching options described above consider only ASCII NL (\n) to be
1172 a newline, but SQL would have us treat CR (\r), CRLF (\r\n) (a
1173 Windows-style newline), and some Unicode-only characters like LINE
1174 SEPARATOR (U+2028) as newlines as well. Notably, . and \s should
1175 count \r\n as one character not two according to SQL.
1176 * Of the character-entry escapes described in Table 9.20, XQuery
1177 supports only \n, \r, and \t.
1178 * XQuery does not support the [:name:] syntax for character classes
1179 within bracket expressions.
1180 * XQuery does not have lookahead or lookbehind constraints, nor any
1181 of the constraint escapes described in Table 9.22.
1182 * The metasyntax forms described in Section 9.7.3.4 do not exist in
1184 * The regular expression flag letters defined by XQuery are related
1185 to but not the same as the option letters for POSIX (Table 9.24).
1186 While the i and q options behave the same, others do not:
1187 + XQuery's s (allow dot to match newline) and m (allow ^ and $
1188 to match at newlines) flags provide access to the same
1189 behaviors as POSIX's n, p and w flags, but they do not match
1190 the behavior of POSIX's s and m flags. Note in particular that
1191 dot-matches-newline is the default behavior in POSIX but not
1193 + XQuery's x (ignore whitespace in pattern) flag is noticeably
1194 different from POSIX's expanded-mode flag. POSIX's x flag also
1195 allows # to begin a comment in the pattern, and POSIX will not
1196 ignore a whitespace character after a backslash.