begriffs open source - ai-pg/blob - full-docs/txt/functions-matching.txt

   1
   2 9.7. Pattern Matching #
   3
   4    9.7.1. LIKE
   5    9.7.2. SIMILAR TO Regular Expressions
   6    9.7.3. POSIX Regular Expressions
   7
   8    There are three separate approaches to pattern matching provided by
   9    PostgreSQL: the traditional SQL LIKE operator, the more recent SIMILAR
  10    TO operator (added in SQL:1999), and POSIX-style regular expressions.
  11    Aside from the basic “does this string match this pattern?” operators,
  12    functions are available to extract or replace matching substrings and
  13    to split a string at matching locations.
  14
  15 Tip
  16
  17    If you have pattern matching needs that go beyond this, consider
  18    writing a user-defined function in Perl or Tcl.
  19
  20 Caution
  21
  22    While most regular-expression searches can be executed very quickly,
  23    regular expressions can be contrived that take arbitrary amounts of
  24    time and memory to process. Be wary of accepting regular-expression
  25    search patterns from hostile sources. If you must do so, it is
  26    advisable to impose a statement timeout.
  27
  28    Searches using SIMILAR TO patterns have the same security hazards,
  29    since SIMILAR TO provides many of the same capabilities as POSIX-style
  30    regular expressions.
  31
  32    LIKE searches, being much simpler than the other two options, are safer
  33    to use with possibly-hostile pattern sources.
  34
  35    SIMILAR TO and POSIX-style regular expressions do not support
  36    nondeterministic collations. If required, use LIKE or apply a different
  37    collation to the expression to work around this limitation.
  38
  39 9.7.1. LIKE #
  40
  41 string LIKE pattern [ESCAPE escape-character]
  42 string NOT LIKE pattern [ESCAPE escape-character]
  43
  44    The LIKE expression returns true if the string matches the supplied
  45    pattern. (As expected, the NOT LIKE expression returns false if LIKE
  46    returns true, and vice versa. An equivalent expression is NOT (string
  47    LIKE pattern).)
  48
  49    If pattern does not contain percent signs or underscores, then the
  50    pattern only represents the string itself; in that case LIKE acts like
  51    the equals operator. An underscore (_) in pattern stands for (matches)
  52    any single character; a percent sign (%) matches any sequence of zero
  53    or more characters.
  54
  55    Some examples:
  56 'abc' LIKE 'abc'    true
  57 'abc' LIKE 'a%'     true
  58 'abc' LIKE '_b_'    true
  59 'abc' LIKE 'c'      false
  60
  61    LIKE pattern matching supports nondeterministic collations (see
  62    Section 23.2.2.4), such as case-insensitive collations or collations
  63    that, say, ignore punctuation. So with a case-insensitive collation,
  64    one could have:
  65 'AbC' LIKE 'abc' COLLATE case_insensitive    true
  66 'AbC' LIKE 'a%' COLLATE case_insensitive     true
  67
  68    With collations that ignore certain characters or in general that
  69    consider strings of different lengths equal, the semantics can become a
  70    bit more complicated. Consider these examples:
  71 '.foo.' LIKE 'foo' COLLATE ign_punct    true
  72 '.foo.' LIKE 'f_o' COLLATE ign_punct    true
  73 '.foo.' LIKE '_oo' COLLATE ign_punct    false
  74
  75    The way the matching works is that the pattern is partitioned into
  76    sequences of wildcards and non-wildcard strings (wildcards being _ and
  77    %). For example, the pattern f_o is partitioned into f, _, o, the
  78    pattern _oo is partitioned into _, oo. The input string matches the
  79    pattern if it can be partitioned in such a way that the wildcards match
  80    one character or any number of characters respectively and the
  81    non-wildcard partitions are equal under the applicable collation. So
  82    for example, '.foo.' LIKE 'f_o' COLLATE ign_punct is true because one
  83    can partition .foo. into .f, o, o., and then '.f' = 'f' COLLATE
  84    ign_punct, 'o' matches the _ wildcard, and 'o.' = 'o' COLLATE
  85    ign_punct. But '.foo.' LIKE '_oo' COLLATE ign_punct is false because
  86    .foo. cannot be partitioned in a way that the first character is any
  87    character and the rest of the string compares equal to oo. (Note that
  88    the single-character wildcard always matches exactly one character,
  89    independent of the collation. So in this example, the _ would match .,
  90    but then the rest of the input string won't match the rest of the
  91    pattern.)
  92
  93    LIKE pattern matching always covers the entire string. Therefore, if
  94    it's desired to match a sequence anywhere within a string, the pattern
  95    must start and end with a percent sign.
  96
  97    To match a literal underscore or percent sign without matching other
  98    characters, the respective character in pattern must be preceded by the
  99    escape character. The default escape character is the backslash but a
 100    different one can be selected by using the ESCAPE clause. To match the
 101    escape character itself, write two escape characters.
 102
 103 Note
 104
 105    If you have standard_conforming_strings turned off, any backslashes you
 106    write in literal string constants will need to be doubled. See
 107    Section 4.1.2.1 for more information.
 108
 109    It's also possible to select no escape character by writing ESCAPE ''.
 110    This effectively disables the escape mechanism, which makes it
 111    impossible to turn off the special meaning of underscore and percent
 112    signs in the pattern.
 113
 114    According to the SQL standard, omitting ESCAPE means there is no escape
 115    character (rather than defaulting to a backslash), and a zero-length
 116    ESCAPE value is disallowed. PostgreSQL's behavior in this regard is
 117    therefore slightly nonstandard.
 118
 119    The key word ILIKE can be used instead of LIKE to make the match
 120    case-insensitive according to the active locale. (But this does not
 121    support nondeterministic collations.) This is not in the SQL standard
 122    but is a PostgreSQL extension.
 123
 124    The operator ~~ is equivalent to LIKE, and ~~* corresponds to ILIKE.
 125    There are also !~~ and !~~* operators that represent NOT LIKE and NOT
 126    ILIKE, respectively. All of these operators are PostgreSQL-specific.
 127    You may see these operator names in EXPLAIN output and similar places,
 128    since the parser actually translates LIKE et al. to these operators.
 129
 130    The phrases LIKE, ILIKE, NOT LIKE, and NOT ILIKE are generally treated
 131    as operators in PostgreSQL syntax; for example they can be used in
 132    expression operator ANY (subquery) constructs, although an ESCAPE
 133    clause cannot be included there. In some obscure cases it may be
 134    necessary to use the underlying operator names instead.
 135
 136    Also see the starts-with operator ^@ and the corresponding
 137    starts_with() function, which are useful in cases where simply matching
 138    the beginning of a string is needed.
 139
 140 9.7.2. SIMILAR TO Regular Expressions #
 141
 142 string SIMILAR TO pattern [ESCAPE escape-character]
 143 string NOT SIMILAR TO pattern [ESCAPE escape-character]
 144
 145    The SIMILAR TO operator returns true or false depending on whether its
 146    pattern matches the given string. It is similar to LIKE, except that it
 147    interprets the pattern using the SQL standard's definition of a regular
 148    expression. SQL regular expressions are a curious cross between LIKE
 149    notation and common (POSIX) regular expression notation.
 150
 151    Like LIKE, the SIMILAR TO operator succeeds only if its pattern matches
 152    the entire string; this is unlike common regular expression behavior
 153    where the pattern can match any part of the string. Also like LIKE,
 154    SIMILAR TO uses _ and % as wildcard characters denoting any single
 155    character and any string, respectively (these are comparable to . and
 156    .* in POSIX regular expressions).
 157
 158    In addition to these facilities borrowed from LIKE, SIMILAR TO supports
 159    these pattern-matching metacharacters borrowed from POSIX regular
 160    expressions:
 161      * | denotes alternation (either of two alternatives).
 162      * * denotes repetition of the previous item zero or more times.
 163      * + denotes repetition of the previous item one or more times.
 164      * ? denotes repetition of the previous item zero or one time.
 165      * {m} denotes repetition of the previous item exactly m times.
 166      * {m,} denotes repetition of the previous item m or more times.
 167      * {m,n} denotes repetition of the previous item at least m and not
 168        more than n times.
 169      * Parentheses () can be used to group items into a single logical
 170        item.
 171      * A bracket expression [...] specifies a character class, just as in
 172        POSIX regular expressions.
 173
 174    Notice that the period (.) is not a metacharacter for SIMILAR TO.
 175
 176    As with LIKE, a backslash disables the special meaning of any of these
 177    metacharacters. A different escape character can be specified with
 178    ESCAPE, or the escape capability can be disabled by writing ESCAPE ''.
 179
 180    According to the SQL standard, omitting ESCAPE means there is no escape
 181    character (rather than defaulting to a backslash), and a zero-length
 182    ESCAPE value is disallowed. PostgreSQL's behavior in this regard is
 183    therefore slightly nonstandard.
 184
 185    Another nonstandard extension is that following the escape character
 186    with a letter or digit provides access to the escape sequences defined
 187    for POSIX regular expressions; see Table 9.20, Table 9.21, and
 188    Table 9.22 below.
 189
 190    Some examples:
 191 'abc' SIMILAR TO 'abc'          true
 192 'abc' SIMILAR TO 'a'            false
 193 'abc' SIMILAR TO '%(b|d)%'      true
 194 'abc' SIMILAR TO '(b|c)%'       false
 195 '-abc-' SIMILAR TO '%\mabc\M%'  true
 196 'xabcy' SIMILAR TO '%\mabc\M%'  false
 197
 198    The substring function with three parameters provides extraction of a
 199    substring that matches an SQL regular expression pattern. The function
 200    can be written according to standard SQL syntax:
 201 substring(string similar pattern escape escape-character)
 202
 203    or using the now obsolete SQL:1999 syntax:
 204 substring(string from pattern for escape-character)
 205
 206    or as a plain three-argument function:
 207 substring(string, pattern, escape-character)
 208
 209    As with SIMILAR TO, the specified pattern must match the entire data
 210    string, or else the function fails and returns null. To indicate the
 211    part of the pattern for which the matching data sub-string is of
 212    interest, the pattern should contain two occurrences of the escape
 213    character followed by a double quote ("). The text matching the portion
 214    of the pattern between these separators is returned when the match is
 215    successful.
 216
 217    The escape-double-quote separators actually divide substring's pattern
 218    into three independent regular expressions; for example, a vertical bar
 219    (|) in any of the three sections affects only that section. Also, the
 220    first and third of these regular expressions are defined to match the
 221    smallest possible amount of text, not the largest, when there is any
 222    ambiguity about how much of the data string matches which pattern. (In
 223    POSIX parlance, the first and third regular expressions are forced to
 224    be non-greedy.)
 225
 226    As an extension to the SQL standard, PostgreSQL allows there to be just
 227    one escape-double-quote separator, in which case the third regular
 228    expression is taken as empty; or no separators, in which case the first
 229    and third regular expressions are taken as empty.
 230
 231    Some examples, with #" delimiting the return string:
 232 substring('foobar' similar '%#"o_b#"%' escape '#')   oob
 233 substring('foobar' similar '#"o_b#"%' escape '#')    NULL
 234
 235 9.7.3. POSIX Regular Expressions #
 236
 237    Table 9.16 lists the available operators for pattern matching using
 238    POSIX regular expressions.
 239
 240    Table 9.16. Regular Expression Match Operators
 241
 242    Operator
 243
 244    Description
 245
 246    Example(s)
 247
 248    text ~ text → boolean
 249
 250    String matches regular expression, case sensitively
 251
 252    'thomas' ~ 't.*ma' → t
 253
 254    text ~* text → boolean
 255
 256    String matches regular expression, case-insensitively
 257
 258    'thomas' ~* 'T.*ma' → t
 259
 260    text !~ text → boolean
 261
 262    String does not match regular expression, case sensitively
 263
 264    'thomas' !~ 't.*max' → t
 265
 266    text !~* text → boolean
 267
 268    String does not match regular expression, case-insensitively
 269
 270    'thomas' !~* 'T.*ma' → f
 271
 272    POSIX regular expressions provide a more powerful means for pattern
 273    matching than the LIKE and SIMILAR TO operators. Many Unix tools such
 274    as egrep, sed, or awk use a pattern matching language that is similar
 275    to the one described here.
 276
 277    A regular expression is a character sequence that is an abbreviated
 278    definition of a set of strings (a regular set). A string is said to
 279    match a regular expression if it is a member of the regular set
 280    described by the regular expression. As with LIKE, pattern characters
 281    match string characters exactly unless they are special characters in
 282    the regular expression language — but regular expressions use different
 283    special characters than LIKE does. Unlike LIKE patterns, a regular
 284    expression is allowed to match anywhere within a string, unless the
 285    regular expression is explicitly anchored to the beginning or end of
 286    the string.
 287
 288    Some examples:
 289 'abcd' ~ 'bc'     true
 290 'abcd' ~ 'a.c'    true — dot matches any character
 291 'abcd' ~ 'a.*d'   true — * repeats the preceding pattern item
 292 'abcd' ~ '(b|x)'  true — | means OR, parentheses group
 293 'abcd' ~ '^a'     true — ^ anchors to start of string
 294 'abcd' ~ '^(b|c)' false — would match except for anchoring
 295
 296    The POSIX pattern language is described in much greater detail below.
 297
 298    The substring function with two parameters, substring(string from
 299    pattern), provides extraction of a substring that matches a POSIX
 300    regular expression pattern. It returns null if there is no match,
 301    otherwise the first portion of the text that matched the pattern. But
 302    if the pattern contains any parentheses, the portion of the text that
 303    matched the first parenthesized subexpression (the one whose left
 304    parenthesis comes first) is returned. You can put parentheses around
 305    the whole expression if you want to use parentheses within it without
 306    triggering this exception. If you need parentheses in the pattern
 307    before the subexpression you want to extract, see the non-capturing
 308    parentheses described below.
 309
 310    Some examples:
 311 substring('foobar' from 'o.b')     oob
 312 substring('foobar' from 'o(.)b')   o
 313
 314    The regexp_count function counts the number of places where a POSIX
 315    regular expression pattern matches a string. It has the syntax
 316    regexp_count(string, pattern [, start [, flags ]]). pattern is searched
 317    for in string, normally from the beginning of the string, but if the
 318    start parameter is provided then beginning from that character index.
 319    The flags parameter is an optional text string containing zero or more
 320    single-letter flags that change the function's behavior. For example,
 321    including i in flags specifies case-insensitive matching. Supported
 322    flags are described in Table 9.24.
 323
 324    Some examples:
 325 regexp_count('ABCABCAXYaxy', 'A.')          3
 326 regexp_count('ABCABCAXYaxy', 'A.', 1, 'i')  4
 327
 328    The regexp_instr function returns the starting or ending position of
 329    the N'th match of a POSIX regular expression pattern to a string, or
 330    zero if there is no such match. It has the syntax regexp_instr(string,
 331    pattern [, start [, N [, endoption [, flags [, subexpr ]]]]]). pattern
 332    is searched for in string, normally from the beginning of the string,
 333    but if the start parameter is provided then beginning from that
 334    character index. If N is specified then the N'th match of the pattern
 335    is located, otherwise the first match is located. If the endoption
 336    parameter is omitted or specified as zero, the function returns the
 337    position of the first character of the match. Otherwise, endoption must
 338    be one, and the function returns the position of the character
 339    following the match. The flags parameter is an optional text string
 340    containing zero or more single-letter flags that change the function's
 341    behavior. Supported flags are described in Table 9.24. For a pattern
 342    containing parenthesized subexpressions, subexpr is an integer
 343    indicating which subexpression is of interest: the result identifies
 344    the position of the substring matching that subexpression.
 345    Subexpressions are numbered in the order of their leading parentheses.
 346    When subexpr is omitted or zero, the result identifies the position of
 347    the whole match regardless of parenthesized subexpressions.
 348
 349    Some examples:
 350 regexp_instr('number of your street, town zip, FR', '[^,]+', 1, 2)
 351                                    23
 352 regexp_instr(string=>'ABCDEFGHI', pattern=>'(c..)(...)', start=>1, "N"=>1, endop
 353 tion=>0, flags=>'i', subexpr=>2)
 354                                    6
 355
 356    The regexp_like function checks whether a match of a POSIX regular
 357    expression pattern occurs within a string, returning boolean true or
 358    false. It has the syntax regexp_like(string, pattern [, flags ]). The
 359    flags parameter is an optional text string containing zero or more
 360    single-letter flags that change the function's behavior. Supported
 361    flags are described in Table 9.24. This function has the same results
 362    as the ~ operator if no flags are specified. If only the i flag is
 363    specified, it has the same results as the ~* operator.
 364
 365    Some examples:
 366 regexp_like('Hello World', 'world')       false
 367 regexp_like('Hello World', 'world', 'i')  true
 368
 369    The regexp_match function returns a text array of matching substring(s)
 370    within the first match of a POSIX regular expression pattern to a
 371    string. It has the syntax regexp_match(string, pattern [, flags ]). If
 372    there is no match, the result is NULL. If a match is found, and the
 373    pattern contains no parenthesized subexpressions, then the result is a
 374    single-element text array containing the substring matching the whole
 375    pattern. If a match is found, and the pattern contains parenthesized
 376    subexpressions, then the result is a text array whose n'th element is
 377    the substring matching the n'th parenthesized subexpression of the
 378    pattern (not counting “non-capturing” parentheses; see below for
 379    details). The flags parameter is an optional text string containing
 380    zero or more single-letter flags that change the function's behavior.
 381    Supported flags are described in Table 9.24.
 382
 383    Some examples:
 384 SELECT regexp_match('foobarbequebaz', 'bar.*que');
 385  regexp_match
 386 --------------
 387  {barbeque}
 388 (1 row)
 389
 390 SELECT regexp_match('foobarbequebaz', '(bar)(beque)');
 391  regexp_match
 392 --------------
 393  {bar,beque}
 394 (1 row)
 395
 396 Tip
 397
 398    In the common case where you just want the whole matching substring or
 399    NULL for no match, the best solution is to use regexp_substr().
 400    However, regexp_substr() only exists in PostgreSQL version 15 and up.
 401    When working in older versions, you can extract the first element of
 402    regexp_match()'s result, for example:
 403 SELECT (regexp_match('foobarbequebaz', 'bar.*que'))[1];
 404  regexp_match
 405 --------------
 406  barbeque
 407 (1 row)
 408
 409    The regexp_matches function returns a set of text arrays of matching
 410    substring(s) within matches of a POSIX regular expression pattern to a
 411    string. It has the same syntax as regexp_match. This function returns
 412    no rows if there is no match, one row if there is a match and the g
 413    flag is not given, or N rows if there are N matches and the g flag is
 414    given. Each returned row is a text array containing the whole matched
 415    substring or the substrings matching parenthesized subexpressions of
 416    the pattern, just as described above for regexp_match. regexp_matches
 417    accepts all the flags shown in Table 9.24, plus the g flag which
 418    commands it to return all matches, not just the first one.
 419
 420    Some examples:
 421 SELECT regexp_matches('foo', 'not there');
 422  regexp_matches
 423 ----------------
 424 (0 rows)
 425
 426 SELECT regexp_matches('foobarbequebazilbarfbonk', '(b[^b]+)(b[^b]+)', 'g');
 427  regexp_matches
 428 ----------------
 429  {bar,beque}
 430  {bazil,barf}
 431 (2 rows)
 432
 433 Tip
 434
 435    In most cases regexp_matches() should be used with the g flag, since if
 436    you only want the first match, it's easier and more efficient to use
 437    regexp_match(). However, regexp_match() only exists in PostgreSQL
 438    version 10 and up. When working in older versions, a common trick is to
 439    place a regexp_matches() call in a sub-select, for example:
 440 SELECT col1, (SELECT regexp_matches(col2, '(bar)(beque)')) FROM tab;
 441
 442    This produces a text array if there's a match, or NULL if not, the same
 443    as regexp_match() would do. Without the sub-select, this query would
 444    produce no output at all for table rows without a match, which is
 445    typically not the desired behavior.
 446
 447    The regexp_replace function provides substitution of new text for
 448    substrings that match POSIX regular expression patterns. It has the
 449    syntax regexp_replace(string, pattern, replacement [, flags ]) or
 450    regexp_replace(string, pattern, replacement, start [, N [, flags ]]).
 451    The source string is returned unchanged if there is no match to the
 452    pattern. If there is a match, the string is returned with the
 453    replacement string substituted for the matching substring. The
 454    replacement string can contain \n, where n is 1 through 9, to indicate
 455    that the source substring matching the n'th parenthesized subexpression
 456    of the pattern should be inserted, and it can contain \& to indicate
 457    that the substring matching the entire pattern should be inserted.
 458    Write \\ if you need to put a literal backslash in the replacement
 459    text. pattern is searched for in string, normally from the beginning of
 460    the string, but if the start parameter is provided then beginning from
 461    that character index. By default, only the first match of the pattern
 462    is replaced. If N is specified and is greater than zero, then the N'th
 463    match of the pattern is replaced. If the g flag is given, or if N is
 464    specified and is zero, then all matches at or after the start position
 465    are replaced. (The g flag is ignored when N is specified.) The flags
 466    parameter is an optional text string containing zero or more
 467    single-letter flags that change the function's behavior. Supported
 468    flags (though not g) are described in Table 9.24.
 469
 470    Some examples:
 471 regexp_replace('foobarbaz', 'b..', 'X')
 472                                    fooXbaz
 473 regexp_replace('foobarbaz', 'b..', 'X', 'g')
 474                                    fooXX
 475 regexp_replace('foobarbaz', 'b(..)', 'X\1Y', 'g')
 476                                    fooXarYXazY
 477 regexp_replace('A PostgreSQL function', 'a|e|i|o|u', 'X', 1, 0, 'i')
 478                                    X PXstgrXSQL fXnctXXn
 479 regexp_replace(string=>'A PostgreSQL function', pattern=>'a|e|i|o|u', replacemen
 480 t=>'X', start=>1, "N"=>3, flags=>'i')
 481                                    A PostgrXSQL function
 482
 483    The regexp_split_to_table function splits a string using a POSIX
 484    regular expression pattern as a delimiter. It has the syntax
 485    regexp_split_to_table(string, pattern [, flags ]). If there is no match
 486    to the pattern, the function returns the string. If there is at least
 487    one match, for each match it returns the text from the end of the last
 488    match (or the beginning of the string) to the beginning of the match.
 489    When there are no more matches, it returns the text from the end of the
 490    last match to the end of the string. The flags parameter is an optional
 491    text string containing zero or more single-letter flags that change the
 492    function's behavior. regexp_split_to_table supports the flags described
 493    in Table 9.24.
 494
 495    The regexp_split_to_array function behaves the same as
 496    regexp_split_to_table, except that regexp_split_to_array returns its
 497    result as an array of text. It has the syntax
 498    regexp_split_to_array(string, pattern [, flags ]). The parameters are
 499    the same as for regexp_split_to_table.
 500
 501    Some examples:
 502 SELECT foo FROM regexp_split_to_table('the quick brown fox jumps over the lazy d
 503 og', '\s+') AS foo;
 504   foo
 505 -------
 506  the
 507  quick
 508  brown
 509  fox
 510  jumps
 511  over
 512  the
 513  lazy
 514  dog
 515 (9 rows)
 516
 517 SELECT regexp_split_to_array('the quick brown fox jumps over the lazy dog', '\s+
 518 ');
 519               regexp_split_to_array
 520 -----------------------------------------------
 521  {the,quick,brown,fox,jumps,over,the,lazy,dog}
 522 (1 row)
 523
 524 SELECT foo FROM regexp_split_to_table('the quick brown fox', '\s*') AS foo;
 525  foo
 526 -----
 527  t
 528  h
 529  e
 530  q
 531  u
 532  i
 533  c
 534  k
 535  b
 536  r
 537  o
 538  w
 539  n
 540  f
 541  o
 542  x
 543 (16 rows)
 544
 545    As the last example demonstrates, the regexp split functions ignore
 546    zero-length matches that occur at the start or end of the string or
 547    immediately after a previous match. This is contrary to the strict
 548    definition of regexp matching that is implemented by the other regexp
 549    functions, but is usually the most convenient behavior in practice.
 550    Other software systems such as Perl use similar definitions.
 551
 552    The regexp_substr function returns the substring that matches a POSIX
 553    regular expression pattern, or NULL if there is no match. It has the
 554    syntax regexp_substr(string, pattern [, start [, N [, flags [, subexpr
 555    ]]]]). pattern is searched for in string, normally from the beginning
 556    of the string, but if the start parameter is provided then beginning
 557    from that character index. If N is specified then the N'th match of the
 558    pattern is returned, otherwise the first match is returned. The flags
 559    parameter is an optional text string containing zero or more
 560    single-letter flags that change the function's behavior. Supported
 561    flags are described in Table 9.24. For a pattern containing
 562    parenthesized subexpressions, subexpr is an integer indicating which
 563    subexpression is of interest: the result is the substring matching that
 564    subexpression. Subexpressions are numbered in the order of their
 565    leading parentheses. When subexpr is omitted or zero, the result is the
 566    whole match regardless of parenthesized subexpressions.
 567
 568    Some examples:
 569 regexp_substr('number of your street, town zip, FR', '[^,]+', 1, 2)
 570                                     town zip
 571 regexp_substr('ABCDEFGHI', '(c..)(...)', 1, 1, 'i', 2)
 572                                    FGH
 573
 574 9.7.3.1. Regular Expression Details #
 575
 576    PostgreSQL's regular expressions are implemented using a software
 577    package written by Henry Spencer. Much of the description of regular
 578    expressions below is copied verbatim from his manual.
 579
 580    Regular expressions (REs), as defined in POSIX 1003.2, come in two
 581    forms: extended REs or EREs (roughly those of egrep), and basic REs or
 582    BREs (roughly those of ed). PostgreSQL supports both forms, and also
 583    implements some extensions that are not in the POSIX standard, but have
 584    become widely used due to their availability in programming languages
 585    such as Perl and Tcl. REs using these non-POSIX extensions are called
 586    advanced REs or AREs in this documentation. AREs are almost an exact
 587    superset of EREs, but BREs have several notational incompatibilities
 588    (as well as being much more limited). We first describe the ARE and ERE
 589    forms, noting features that apply only to AREs, and then describe how
 590    BREs differ.
 591
 592 Note
 593
 594    PostgreSQL always initially presumes that a regular expression follows
 595    the ARE rules. However, the more limited ERE or BRE rules can be chosen
 596    by prepending an embedded option to the RE pattern, as described in
 597    Section 9.7.3.4. This can be useful for compatibility with applications
 598    that expect exactly the POSIX 1003.2 rules.
 599
 600    A regular expression is defined as one or more branches, separated by
 601    |. It matches anything that matches one of the branches.
 602
 603    A branch is zero or more quantified atoms or constraints, concatenated.
 604    It matches a match for the first, followed by a match for the second,
 605    etc.; an empty branch matches the empty string.
 606
 607    A quantified atom is an atom possibly followed by a single quantifier.
 608    Without a quantifier, it matches a match for the atom. With a
 609    quantifier, it can match some number of matches of the atom. An atom
 610    can be any of the possibilities shown in Table 9.17. The possible
 611    quantifiers and their meanings are shown in Table 9.18.
 612
 613    A constraint matches an empty string, but matches only when specific
 614    conditions are met. A constraint can be used where an atom could be
 615    used, except it cannot be followed by a quantifier. The simple
 616    constraints are shown in Table 9.19; some more constraints are
 617    described later.
 618
 619    Table 9.17. Regular Expression Atoms
 620    Atom Description
 621    (re) (where re is any regular expression) matches a match for re, with
 622    the match noted for possible reporting
 623    (?:re) as above, but the match is not noted for reporting (a
 624    “non-capturing” set of parentheses) (AREs only)
 625    . matches any single character
 626    [chars] a bracket expression, matching any one of the chars (see
 627    Section 9.7.3.2 for more detail)
 628    \k (where k is a non-alphanumeric character) matches that character
 629    taken as an ordinary character, e.g., \\ matches a backslash character
 630    \c where c is alphanumeric (possibly followed by other characters) is
 631    an escape, see Section 9.7.3.3 (AREs only; in EREs and BREs, this
 632    matches c)
 633    { when followed by a character other than a digit, matches the
 634    left-brace character {; when followed by a digit, it is the beginning
 635    of a bound (see below)
 636    x where x is a single character with no other significance, matches
 637    that character
 638
 639    An RE cannot end with a backslash (\).
 640
 641 Note
 642
 643    If you have standard_conforming_strings turned off, any backslashes you
 644    write in literal string constants will need to be doubled. See
 645    Section 4.1.2.1 for more information.
 646
 647    Table 9.18. Regular Expression Quantifiers
 648    Quantifier Matches
 649    * a sequence of 0 or more matches of the atom
 650    + a sequence of 1 or more matches of the atom
 651    ? a sequence of 0 or 1 matches of the atom
 652    {m} a sequence of exactly m matches of the atom
 653    {m,} a sequence of m or more matches of the atom
 654    {m,n} a sequence of m through n (inclusive) matches of the atom; m
 655    cannot exceed n
 656    *? non-greedy version of *
 657    +? non-greedy version of +
 658    ?? non-greedy version of ?
 659    {m}? non-greedy version of {m}
 660    {m,}? non-greedy version of {m,}
 661    {m,n}? non-greedy version of {m,n}
 662
 663    The forms using {...} are known as bounds. The numbers m and n within a
 664    bound are unsigned decimal integers with permissible values from 0 to
 665    255 inclusive.
 666
 667    Non-greedy quantifiers (available in AREs only) match the same
 668    possibilities as their corresponding normal (greedy) counterparts, but
 669    prefer the smallest number rather than the largest number of matches.
 670    See Section 9.7.3.5 for more detail.
 671
 672 Note
 673
 674    A quantifier cannot immediately follow another quantifier, e.g., ** is
 675    invalid. A quantifier cannot begin an expression or subexpression or
 676    follow ^ or |.
 677
 678    Table 9.19. Regular Expression Constraints
 679    Constraint Description
 680    ^ matches at the beginning of the string
 681    $ matches at the end of the string
 682    (?=re) positive lookahead matches at any point where a substring
 683    matching re begins (AREs only)
 684    (?!re) negative lookahead matches at any point where no substring
 685    matching re begins (AREs only)
 686    (?<=re) positive lookbehind matches at any point where a substring
 687    matching re ends (AREs only)
 688    (?<!re) negative lookbehind matches at any point where no substring
 689    matching re ends (AREs only)
 690
 691    Lookahead and lookbehind constraints cannot contain back references
 692    (see Section 9.7.3.3), and all parentheses within them are considered
 693    non-capturing.
 694
 695 9.7.3.2. Bracket Expressions #
 696
 697    A bracket expression is a list of characters enclosed in []. It
 698    normally matches any single character from the list (but see below). If
 699    the list begins with ^, it matches any single character not from the
 700    rest of the list. If two characters in the list are separated by -,
 701    this is shorthand for the full range of characters between those two
 702    (inclusive) in the collating sequence, e.g., [0-9] in ASCII matches any
 703    decimal digit. It is illegal for two ranges to share an endpoint, e.g.,
 704    a-c-e. Ranges are very collating-sequence-dependent, so portable
 705    programs should avoid relying on them.
 706
 707    To include a literal ] in the list, make it the first character (after
 708    ^, if that is used). To include a literal -, make it the first or last
 709    character, or the second endpoint of a range. To use a literal - as the
 710    first endpoint of a range, enclose it in [. and .] to make it a
 711    collating element (see below). With the exception of these characters,
 712    some combinations using [ (see next paragraphs), and escapes (AREs
 713    only), all other special characters lose their special significance
 714    within a bracket expression. In particular, \ is not special when
 715    following ERE or BRE rules, though it is special (as introducing an
 716    escape) in AREs.
 717
 718    Within a bracket expression, a collating element (a character, a
 719    multiple-character sequence that collates as if it were a single
 720    character, or a collating-sequence name for either) enclosed in [. and
 721    .] stands for the sequence of characters of that collating element. The
 722    sequence is treated as a single element of the bracket expression's
 723    list. This allows a bracket expression containing a multiple-character
 724    collating element to match more than one character, e.g., if the
 725    collating sequence includes a ch collating element, then the RE
 726    [[.ch.]]*c matches the first five characters of chchcc.
 727
 728 Note
 729
 730    PostgreSQL currently does not support multi-character collating
 731    elements. This information describes possible future behavior.
 732
 733    Within a bracket expression, a collating element enclosed in [= and =]
 734    is an equivalence class, standing for the sequences of characters of
 735    all collating elements equivalent to that one, including itself. (If
 736    there are no other equivalent collating elements, the treatment is as
 737    if the enclosing delimiters were [. and .].) For example, if o and ^
 738    are the members of an equivalence class, then [[=o=]], [[=^=]], and
 739    [o^] are all synonymous. An equivalence class cannot be an endpoint of
 740    a range.
 741
 742    Within a bracket expression, the name of a character class enclosed in
 743    [: and :] stands for the list of all characters belonging to that
 744    class. A character class cannot be used as an endpoint of a range. The
 745    POSIX standard defines these character class names: alnum (letters and
 746    numeric digits), alpha (letters), blank (space and tab), cntrl (control
 747    characters), digit (numeric digits), graph (printable characters except
 748    space), lower (lower-case letters), print (printable characters
 749    including space), punct (punctuation), space (any white space), upper
 750    (upper-case letters), and xdigit (hexadecimal digits). The behavior of
 751    these standard character classes is generally consistent across
 752    platforms for characters in the 7-bit ASCII set. Whether a given
 753    non-ASCII character is considered to belong to one of these classes
 754    depends on the collation that is used for the regular-expression
 755    function or operator (see Section 23.2), or by default on the
 756    database's LC_CTYPE locale setting (see Section 23.1). The
 757    classification of non-ASCII characters can vary across platforms even
 758    in similarly-named locales. (But the C locale never considers any
 759    non-ASCII characters to belong to any of these classes.) In addition to
 760    these standard character classes, PostgreSQL defines the word character
 761    class, which is the same as alnum plus the underscore (_) character,
 762    and the ascii character class, which contains exactly the 7-bit ASCII
 763    set.
 764
 765    There are two special cases of bracket expressions: the bracket
 766    expressions [[:<:]] and [[:>:]] are constraints, matching empty strings
 767    at the beginning and end of a word respectively. A word is defined as a
 768    sequence of word characters that is neither preceded nor followed by
 769    word characters. A word character is any character belonging to the
 770    word character class, that is, any letter, digit, or underscore. This
 771    is an extension, compatible with but not specified by POSIX 1003.2, and
 772    should be used with caution in software intended to be portable to
 773    other systems. The constraint escapes described below are usually
 774    preferable; they are no more standard, but are easier to type.
 775
 776 9.7.3.3. Regular Expression Escapes #
 777
 778    Escapes are special sequences beginning with \ followed by an
 779    alphanumeric character. Escapes come in several varieties: character
 780    entry, class shorthands, constraint escapes, and back references. A \
 781    followed by an alphanumeric character but not constituting a valid
 782    escape is illegal in AREs. In EREs, there are no escapes: outside a
 783    bracket expression, a \ followed by an alphanumeric character merely
 784    stands for that character as an ordinary character, and inside a
 785    bracket expression, \ is an ordinary character. (The latter is the one
 786    actual incompatibility between EREs and AREs.)
 787
 788    Character-entry escapes exist to make it easier to specify non-printing
 789    and other inconvenient characters in REs. They are shown in Table 9.20.
 790
 791    Class-shorthand escapes provide shorthands for certain commonly-used
 792    character classes. They are shown in Table 9.21.
 793
 794    A constraint escape is a constraint, matching the empty string if
 795    specific conditions are met, written as an escape. They are shown in
 796    Table 9.22.
 797
 798    A back reference (\n) matches the same string matched by the previous
 799    parenthesized subexpression specified by the number n (see Table 9.23).
 800    For example, ([bc])\1 matches bb or cc but not bc or cb. The
 801    subexpression must entirely precede the back reference in the RE.
 802    Subexpressions are numbered in the order of their leading parentheses.
 803    Non-capturing parentheses do not define subexpressions. The back
 804    reference considers only the string characters matched by the
 805    referenced subexpression, not any constraints contained in it. For
 806    example, (^\d)\1 will match 22.
 807
 808    Table 9.20. Regular Expression Character-Entry Escapes
 809    Escape Description
 810    \a alert (bell) character, as in C
 811    \b backspace, as in C
 812    \B synonym for backslash (\) to help reduce the need for backslash
 813    doubling
 814    \cX (where X is any character) the character whose low-order 5 bits are
 815    the same as those of X, and whose other bits are all zero
 816    \e the character whose collating-sequence name is ESC, or failing that,
 817    the character with octal value 033
 818    \f form feed, as in C
 819    \n newline, as in C
 820    \r carriage return, as in C
 821    \t horizontal tab, as in C
 822    \uwxyz (where wxyz is exactly four hexadecimal digits) the character
 823    whose hexadecimal value is 0xwxyz
 824    \Ustuvwxyz (where stuvwxyz is exactly eight hexadecimal digits) the
 825    character whose hexadecimal value is 0xstuvwxyz
 826    \v vertical tab, as in C
 827    \xhhh (where hhh is any sequence of hexadecimal digits) the character
 828    whose hexadecimal value is 0xhhh (a single character no matter how many
 829    hexadecimal digits are used)
 830    \0 the character whose value is 0 (the null byte)
 831    \xy (where xy is exactly two octal digits, and is not a back reference)
 832    the character whose octal value is 0xy
 833    \xyz (where xyz is exactly three octal digits, and is not a back
 834    reference) the character whose octal value is 0xyz
 835
 836    Hexadecimal digits are 0-9, a-f, and A-F. Octal digits are 0-7.
 837
 838    Numeric character-entry escapes specifying values outside the ASCII
 839    range (0–127) have meanings dependent on the database encoding. When
 840    the encoding is UTF-8, escape values are equivalent to Unicode code
 841    points, for example \u1234 means the character U+1234. For other
 842    multibyte encodings, character-entry escapes usually just specify the
 843    concatenation of the byte values for the character. If the escape value
 844    does not correspond to any legal character in the database encoding, no
 845    error will be raised, but it will never match any data.
 846
 847    The character-entry escapes are always taken as ordinary characters.
 848    For example, \135 is ] in ASCII, but \135 does not terminate a bracket
 849    expression.
 850
 851    Table 9.21. Regular Expression Class-Shorthand Escapes
 852    Escape                       Description
 853    \d     matches any digit, like [[:digit:]]
 854    \s     matches any whitespace character, like [[:space:]]
 855    \w     matches any word character, like [[:word:]]
 856    \D     matches any non-digit, like [^[:digit:]]
 857    \S     matches any non-whitespace character, like [^[:space:]]
 858    \W     matches any non-word character, like [^[:word:]]
 859
 860    The class-shorthand escapes also work within bracket expressions,
 861    although the definitions shown above are not quite syntactically valid
 862    in that context. For example, [a-c\d] is equivalent to [a-c[:digit:]].
 863
 864    Table 9.22. Regular Expression Constraint Escapes
 865    Escape Description
 866    \A matches only at the beginning of the string (see Section 9.7.3.5 for
 867    how this differs from ^)
 868    \m matches only at the beginning of a word
 869    \M matches only at the end of a word
 870    \y matches only at the beginning or end of a word
 871    \Y matches only at a point that is not the beginning or end of a word
 872    \Z matches only at the end of the string (see Section 9.7.3.5 for how
 873    this differs from $)
 874
 875    A word is defined as in the specification of [[:<:]] and [[:>:]] above.
 876    Constraint escapes are illegal within bracket expressions.
 877
 878    Table 9.23. Regular Expression Back References
 879    Escape Description
 880    \m (where m is a nonzero digit) a back reference to the m'th
 881    subexpression
 882    \mnn (where m is a nonzero digit, and nn is some more digits, and the
 883    decimal value mnn is not greater than the number of closing capturing
 884    parentheses seen so far) a back reference to the mnn'th subexpression
 885
 886 Note
 887
 888    There is an inherent ambiguity between octal character-entry escapes
 889    and back references, which is resolved by the following heuristics, as
 890    hinted at above. A leading zero always indicates an octal escape. A
 891    single non-zero digit, not followed by another digit, is always taken
 892    as a back reference. A multi-digit sequence not starting with a zero is
 893    taken as a back reference if it comes after a suitable subexpression
 894    (i.e., the number is in the legal range for a back reference), and
 895    otherwise is taken as octal.
 896
 897 9.7.3.4. Regular Expression Metasyntax #
 898
 899    In addition to the main syntax described above, there are some special
 900    forms and miscellaneous syntactic facilities available.
 901
 902    An RE can begin with one of two special director prefixes. If an RE
 903    begins with ***:, the rest of the RE is taken as an ARE. (This normally
 904    has no effect in PostgreSQL, since REs are assumed to be AREs; but it
 905    does have an effect if ERE or BRE mode had been specified by the flags
 906    parameter to a regex function.) If an RE begins with ***=, the rest of
 907    the RE is taken to be a literal string, with all characters considered
 908    ordinary characters.
 909
 910    An ARE can begin with embedded options: a sequence (?xyz) (where xyz is
 911    one or more alphabetic characters) specifies options affecting the rest
 912    of the RE. These options override any previously determined options —
 913    in particular, they can override the case-sensitivity behavior implied
 914    by a regex operator, or the flags parameter to a regex function. The
 915    available option letters are shown in Table 9.24. Note that these same
 916    option letters are used in the flags parameters of regex functions.
 917
 918    Table 9.24. ARE Embedded-Option Letters
 919    Option Description
 920    b rest of RE is a BRE
 921    c case-sensitive matching (overrides operator type)
 922    e rest of RE is an ERE
 923    i case-insensitive matching (see Section 9.7.3.5) (overrides operator
 924    type)
 925    m historical synonym for n
 926    n newline-sensitive matching (see Section 9.7.3.5)
 927    p partial newline-sensitive matching (see Section 9.7.3.5)
 928    q rest of RE is a literal (“quoted”) string, all ordinary characters
 929    s non-newline-sensitive matching (default)
 930    t tight syntax (default; see below)
 931    w inverse partial newline-sensitive (“weird”) matching (see
 932    Section 9.7.3.5)
 933    x expanded syntax (see below)
 934
 935    Embedded options take effect at the ) terminating the sequence. They
 936    can appear only at the start of an ARE (after the ***: director if
 937    any).
 938
 939    In addition to the usual (tight) RE syntax, in which all characters are
 940    significant, there is an expanded syntax, available by specifying the
 941    embedded x option. In the expanded syntax, white-space characters in
 942    the RE are ignored, as are all characters between a # and the following
 943    newline (or the end of the RE). This permits paragraphing and
 944    commenting a complex RE. There are three exceptions to that basic rule:
 945      * a white-space character or # preceded by \ is retained
 946      * white space or # within a bracket expression is retained
 947      * white space and comments cannot appear within multi-character
 948        symbols, such as (?:
 949
 950    For this purpose, white-space characters are blank, tab, newline, and
 951    any character that belongs to the space character class.
 952
 953    Finally, in an ARE, outside bracket expressions, the sequence (?#ttt)
 954    (where ttt is any text not containing a )) is a comment, completely
 955    ignored. Again, this is not allowed between the characters of
 956    multi-character symbols, like (?:. Such comments are more a historical
 957    artifact than a useful facility, and their use is deprecated; use the
 958    expanded syntax instead.
 959
 960    None of these metasyntax extensions is available if an initial ***=
 961    director has specified that the user's input be treated as a literal
 962    string rather than as an RE.
 963
 964 9.7.3.5. Regular Expression Matching Rules #
 965
 966    In the event that an RE could match more than one substring of a given
 967    string, the RE matches the one starting earliest in the string. If the
 968    RE could match more than one substring starting at that point, either
 969    the longest possible match or the shortest possible match will be
 970    taken, depending on whether the RE is greedy or non-greedy.
 971
 972    Whether an RE is greedy or not is determined by the following rules:
 973      * Most atoms, and all constraints, have no greediness attribute
 974        (because they cannot match variable amounts of text anyway).
 975      * Adding parentheses around an RE does not change its greediness.
 976      * A quantified atom with a fixed-repetition quantifier ({m} or {m}?)
 977        has the same greediness (possibly none) as the atom itself.
 978      * A quantified atom with other normal quantifiers (including {m,n}
 979        with m equal to n) is greedy (prefers longest match).
 980      * A quantified atom with a non-greedy quantifier (including {m,n}?
 981        with m equal to n) is non-greedy (prefers shortest match).
 982      * A branch — that is, an RE that has no top-level | operator — has
 983        the same greediness as the first quantified atom in it that has a
 984        greediness attribute.
 985      * An RE consisting of two or more branches connected by the |
 986        operator is always greedy.
 987
 988    The above rules associate greediness attributes not only with
 989    individual quantified atoms, but with branches and entire REs that
 990    contain quantified atoms. What that means is that the matching is done
 991    in such a way that the branch, or whole RE, matches the longest or
 992    shortest possible substring as a whole. Once the length of the entire
 993    match is determined, the part of it that matches any particular
 994    subexpression is determined on the basis of the greediness attribute of
 995    that subexpression, with subexpressions starting earlier in the RE
 996    taking priority over ones starting later.
 997
 998    An example of what this means:
 999 SELECT SUBSTRING('XY1234Z', 'Y*([0-9]{1,3})');
1000 Result: 123
1001 SELECT SUBSTRING('XY1234Z', 'Y*?([0-9]{1,3})');
1002 Result: 1
1003
1004    In the first case, the RE as a whole is greedy because Y* is greedy. It
1005    can match beginning at the Y, and it matches the longest possible
1006    string starting there, i.e., Y123. The output is the parenthesized part
1007    of that, or 123. In the second case, the RE as a whole is non-greedy
1008    because Y*? is non-greedy. It can match beginning at the Y, and it
1009    matches the shortest possible string starting there, i.e., Y1. The
1010    subexpression [0-9]{1,3} is greedy but it cannot change the decision as
1011    to the overall match length; so it is forced to match just 1.
1012
1013    In short, when an RE contains both greedy and non-greedy
1014    subexpressions, the total match length is either as long as possible or
1015    as short as possible, according to the attribute assigned to the whole
1016    RE. The attributes assigned to the subexpressions only affect how much
1017    of that match they are allowed to “eat” relative to each other.
1018
1019    The quantifiers {1,1} and {1,1}? can be used to force greediness or
1020    non-greediness, respectively, on a subexpression or a whole RE. This is
1021    useful when you need the whole RE to have a greediness attribute
1022    different from what's deduced from its elements. As an example, suppose
1023    that we are trying to separate a string containing some digits into the
1024    digits and the parts before and after them. We might try to do that
1025    like this:
1026 SELECT regexp_match('abc01234xyz', '(.*)(\d+)(.*)');
1027 Result: {abc0123,4,xyz}
1028
1029    That didn't work: the first .* is greedy so it “eats” as much as it
1030    can, leaving the \d+ to match at the last possible place, the last
1031    digit. We might try to fix that by making it non-greedy:
1032 SELECT regexp_match('abc01234xyz', '(.*?)(\d+)(.*)');
1033 Result: {abc,0,""}
1034
1035    That didn't work either, because now the RE as a whole is non-greedy
1036    and so it ends the overall match as soon as possible. We can get what
1037    we want by forcing the RE as a whole to be greedy:
1038 SELECT regexp_match('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}');
1039 Result: {abc,01234,xyz}
1040
1041    Controlling the RE's overall greediness separately from its components'
1042    greediness allows great flexibility in handling variable-length
1043    patterns.
1044
1045    When deciding what is a longer or shorter match, match lengths are
1046    measured in characters, not collating elements. An empty string is
1047    considered longer than no match at all. For example: bb* matches the
1048    three middle characters of abbbc; (week|wee)(night|knights) matches all
1049    ten characters of weeknights; when (.*).* is matched against abc the
1050    parenthesized subexpression matches all three characters; and when
1051    (a*)* is matched against bc both the whole RE and the parenthesized
1052    subexpression match an empty string.
1053
1054    If case-independent matching is specified, the effect is much as if all
1055    case distinctions had vanished from the alphabet. When an alphabetic
1056    that exists in multiple cases appears as an ordinary character outside
1057    a bracket expression, it is effectively transformed into a bracket
1058    expression containing both cases, e.g., x becomes [xX]. When it appears
1059    inside a bracket expression, all case counterparts of it are added to
1060    the bracket expression, e.g., [x] becomes [xX] and [^x] becomes [^xX].
1061
1062    If newline-sensitive matching is specified, . and bracket expressions
1063    using ^ will never match the newline character (so that matches will
1064    not cross lines unless the RE explicitly includes a newline) and ^ and
1065    $ will match the empty string after and before a newline respectively,
1066    in addition to matching at beginning and end of string respectively.
1067    But the ARE escapes \A and \Z continue to match beginning or end of
1068    string only. Also, the character class shorthands \D and \W will match
1069    a newline regardless of this mode. (Before PostgreSQL 14, they did not
1070    match newlines when in newline-sensitive mode. Write [^[:digit:]] or
1071    [^[:word:]] to get the old behavior.)
1072
1073    If partial newline-sensitive matching is specified, this affects . and
1074    bracket expressions as with newline-sensitive matching, but not ^ and
1075    $.
1076
1077    If inverse partial newline-sensitive matching is specified, this
1078    affects ^ and $ as with newline-sensitive matching, but not . and
1079    bracket expressions. This isn't very useful but is provided for
1080    symmetry.
1081
1082 9.7.3.6. Limits and Compatibility #
1083
1084    No particular limit is imposed on the length of REs in this
1085    implementation. However, programs intended to be highly portable should
1086    not employ REs longer than 256 bytes, as a POSIX-compliant
1087    implementation can refuse to accept such REs.
1088
1089    The only feature of AREs that is actually incompatible with POSIX EREs
1090    is that \ does not lose its special significance inside bracket
1091    expressions. All other ARE features use syntax which is illegal or has
1092    undefined or unspecified effects in POSIX EREs; the *** syntax of
1093    directors likewise is outside the POSIX syntax for both BREs and EREs.
1094
1095    Many of the ARE extensions are borrowed from Perl, but some have been
1096    changed to clean them up, and a few Perl extensions are not present.
1097    Incompatibilities of note include \b, \B, the lack of special treatment
1098    for a trailing newline, the addition of complemented bracket
1099    expressions to the things affected by newline-sensitive matching, the
1100    restrictions on parentheses and back references in lookahead/lookbehind
1101    constraints, and the longest/shortest-match (rather than first-match)
1102    matching semantics.
1103
1104 9.7.3.7. Basic Regular Expressions #
1105
1106    BREs differ from EREs in several respects. In BREs, |, +, and ? are
1107    ordinary characters and there is no equivalent for their functionality.
1108    The delimiters for bounds are \{ and \}, with { and } by themselves
1109    ordinary characters. The parentheses for nested subexpressions are \(
1110    and \), with ( and ) by themselves ordinary characters. ^ is an
1111    ordinary character except at the beginning of the RE or the beginning
1112    of a parenthesized subexpression, $ is an ordinary character except at
1113    the end of the RE or the end of a parenthesized subexpression, and * is
1114    an ordinary character if it appears at the beginning of the RE or the
1115    beginning of a parenthesized subexpression (after a possible leading
1116    ^). Finally, single-digit back references are available, and \< and \>
1117    are synonyms for [[:<:]] and [[:>:]] respectively; no other escapes are
1118    available in BREs.
1119
1120 9.7.3.8. Differences from SQL Standard and XQuery #
1121
1122    Since SQL:2008, the SQL standard includes regular expression operators
1123    and functions that performs pattern matching according to the XQuery
1124    regular expression standard:
1125      * LIKE_REGEX
1126      * OCCURRENCES_REGEX
1127      * POSITION_REGEX
1128      * SUBSTRING_REGEX
1129      * TRANSLATE_REGEX
1130
1131    PostgreSQL does not currently implement these operators and functions.
1132    You can get approximately equivalent functionality in each case as
1133    shown in Table 9.25. (Various optional clauses on both sides have been
1134    omitted in this table.)
1135
1136    Table 9.25. Regular Expression Functions Equivalencies
1137    SQL standard PostgreSQL
1138    string LIKE_REGEX pattern regexp_like(string, pattern) or string ~
1139    pattern
1140    OCCURRENCES_REGEX(pattern IN string) regexp_count(string, pattern)
1141    POSITION_REGEX(pattern IN string) regexp_instr(string, pattern)
1142    SUBSTRING_REGEX(pattern IN string) regexp_substr(string, pattern)
1143    TRANSLATE_REGEX(pattern IN string WITH replacement)
1144    regexp_replace(string, pattern, replacement)
1145
1146    Regular expression functions similar to those provided by PostgreSQL
1147    are also available in a number of other SQL implementations, whereas
1148    the SQL-standard functions are not as widely implemented. Some of the
1149    details of the regular expression syntax will likely differ in each
1150    implementation.
1151
1152    The SQL-standard operators and functions use XQuery regular
1153    expressions, which are quite close to the ARE syntax described above.
1154    Notable differences between the existing POSIX-based regular-expression
1155    feature and XQuery regular expressions include:
1156      * XQuery character class subtraction is not supported. An example of
1157        this feature is using the following to match only English
1158        consonants: [a-z-[aeiou]].
1159      * XQuery character class shorthands \c, \C, \i, and \I are not
1160        supported.
1161      * XQuery character class elements using \p{UnicodeProperty} or the
1162        inverse \P{UnicodeProperty} are not supported.
1163      * POSIX interprets character classes such as \w (see Table 9.21)
1164        according to the prevailing locale (which you can control by
1165        attaching a COLLATE clause to the operator or function). XQuery
1166        specifies these classes by reference to Unicode character
1167        properties, so equivalent behavior is obtained only with a locale
1168        that follows the Unicode rules.
1169      * The SQL standard (not XQuery itself) attempts to cater for more
1170        variants of “newline” than POSIX does. The newline-sensitive
1171        matching options described above consider only ASCII NL (\n) to be
1172        a newline, but SQL would have us treat CR (\r), CRLF (\r\n) (a
1173        Windows-style newline), and some Unicode-only characters like LINE
1174        SEPARATOR (U+2028) as newlines as well. Notably, . and \s should
1175        count \r\n as one character not two according to SQL.
1176      * Of the character-entry escapes described in Table 9.20, XQuery
1177        supports only \n, \r, and \t.
1178      * XQuery does not support the [:name:] syntax for character classes
1179        within bracket expressions.
1180      * XQuery does not have lookahead or lookbehind constraints, nor any
1181        of the constraint escapes described in Table 9.22.
1182      * The metasyntax forms described in Section 9.7.3.4 do not exist in
1183        XQuery.
1184      * The regular expression flag letters defined by XQuery are related
1185        to but not the same as the option letters for POSIX (Table 9.24).
1186        While the i and q options behave the same, others do not:
1187           + XQuery's s (allow dot to match newline) and m (allow ^ and $
1188             to match at newlines) flags provide access to the same
1189             behaviors as POSIX's n, p and w flags, but they do not match
1190             the behavior of POSIX's s and m flags. Note in particular that
1191             dot-matches-newline is the default behavior in POSIX but not
1192             XQuery.
1193           + XQuery's x (ignore whitespace in pattern) flag is noticeably
1194             different from POSIX's expanded-mode flag. POSIX's x flag also
1195             allows # to begin a comment in the pattern, and POSIX will not
1196             ignore a whitespace character after a backslash.