begriffs open source - ai-pg/blob - full-docs/txt/textsearch-parsers.txt

   1
   2 12.5. Parsers #
   3
   4    Text search parsers are responsible for splitting raw document text
   5    into tokens and identifying each token's type, where the set of
   6    possible types is defined by the parser itself. Note that a parser does
   7    not modify the text at all — it simply identifies plausible word
   8    boundaries. Because of this limited scope, there is less need for
   9    application-specific custom parsers than there is for custom
  10    dictionaries. At present PostgreSQL provides just one built-in parser,
  11    which has been found to be useful for a wide range of applications.
  12
  13    The built-in parser is named pg_catalog.default. It recognizes 23 token
  14    types, shown in Table 12.1.
  15
  16    Table 12.1. Default Parser's Token Types
  17    Alias Description Example
  18    asciiword Word, all ASCII letters elephant
  19    word Word, all letters mañana
  20    numword Word, letters and digits beta1
  21    asciihword Hyphenated word, all ASCII up-to-date
  22    hword Hyphenated word, all letters lógico-matemática
  23    numhword Hyphenated word, letters and digits postgresql-beta1
  24    hword_asciipart Hyphenated word part, all ASCII postgresql in the
  25    context postgresql-beta1
  26    hword_part Hyphenated word part, all letters lógico or matemática in
  27    the context lógico-matemática
  28    hword_numpart Hyphenated word part, letters and digits beta1 in the
  29    context postgresql-beta1
  30    email Email address foo@example.com
  31    protocol Protocol head http://
  32    url URL example.com/stuff/index.html
  33    host Host example.com
  34    url_path URL path /stuff/index.html, in the context of a URL
  35    file File or path name /usr/local/foo.txt, if not within a URL
  36    sfloat Scientific notation -1.234e56
  37    float Decimal notation -1.234
  38    int Signed integer -1234
  39    uint Unsigned integer 1234
  40    version Version number 8.3.0
  41    tag XML tag <a href="dictionaries.html">
  42    entity XML entity &amp;
  43    blank Space symbols (any whitespace or punctuation not otherwise
  44    recognized)
  45
  46 Note
  47
  48    The parser's notion of a “letter” is determined by the database's
  49    locale setting, specifically lc_ctype. Words containing only the basic
  50    ASCII letters are reported as a separate token type, since it is
  51    sometimes useful to distinguish them. In most European languages, token
  52    types word and asciiword should be treated alike.
  53
  54    email does not support all valid email characters as defined by RFC
  55    5322. Specifically, the only non-alphanumeric characters supported for
  56    email user names are period, dash, and underscore.
  57
  58    tag does not support all valid tag names as defined by W3C
  59    Recommendation, XML. Specifically, the only tag names supported are
  60    those starting with an ASCII letter, underscore, or colon, and
  61    containing only letters, digits, hyphens, underscores, periods, and
  62    colons. tag also includes XML comments starting with <!-- and ending
  63    with -->, and XML declarations (but note that this includes anything
  64    starting with <?x and ending with >).
  65
  66    It is possible for the parser to produce overlapping tokens from the
  67    same piece of text. As an example, a hyphenated word will be reported
  68    both as the entire word and as each component:
  69 SELECT alias, description, token FROM ts_debug('foo-bar-beta1');
  70       alias      |               description                |     token
  71 -----------------+------------------------------------------+---------------
  72  numhword        | Hyphenated word, letters and digits      | foo-bar-beta1
  73  hword_asciipart | Hyphenated word part, all ASCII          | foo
  74  blank           | Space symbols                            | -
  75  hword_asciipart | Hyphenated word part, all ASCII          | bar
  76  blank           | Space symbols                            | -
  77  hword_numpart   | Hyphenated word part, letters and digits | beta1
  78
  79    This behavior is desirable since it allows searches to work for both
  80    the whole compound word and for components. Here is another instructive
  81    example:
  82 SELECT alias, description, token FROM ts_debug('http://example.com/stuff/index.h
  83 tml');
  84   alias   |  description  |            token
  85 ----------+---------------+------------------------------
  86  protocol | Protocol head | http://
  87  url      | URL           | example.com/stuff/index.html
  88  host     | Host          | example.com
  89  url_path | URL path      | /stuff/index.html