4 Text search parsers are responsible for splitting raw document text
5 into tokens and identifying each token's type, where the set of
6 possible types is defined by the parser itself. Note that a parser does
7 not modify the text at all — it simply identifies plausible word
8 boundaries. Because of this limited scope, there is less need for
9 application-specific custom parsers than there is for custom
10 dictionaries. At present PostgreSQL provides just one built-in parser,
11 which has been found to be useful for a wide range of applications.
13 The built-in parser is named pg_catalog.default. It recognizes 23 token
14 types, shown in Table 12.1.
16 Table 12.1. Default Parser's Token Types
17 Alias Description Example
18 asciiword Word, all ASCII letters elephant
19 word Word, all letters mañana
20 numword Word, letters and digits beta1
21 asciihword Hyphenated word, all ASCII up-to-date
22 hword Hyphenated word, all letters lógico-matemática
23 numhword Hyphenated word, letters and digits postgresql-beta1
24 hword_asciipart Hyphenated word part, all ASCII postgresql in the
25 context postgresql-beta1
26 hword_part Hyphenated word part, all letters lógico or matemática in
27 the context lógico-matemática
28 hword_numpart Hyphenated word part, letters and digits beta1 in the
29 context postgresql-beta1
30 email Email address foo@example.com
31 protocol Protocol head http://
32 url URL example.com/stuff/index.html
34 url_path URL path /stuff/index.html, in the context of a URL
35 file File or path name /usr/local/foo.txt, if not within a URL
36 sfloat Scientific notation -1.234e56
37 float Decimal notation -1.234
38 int Signed integer -1234
39 uint Unsigned integer 1234
40 version Version number 8.3.0
41 tag XML tag <a href="dictionaries.html">
42 entity XML entity &
43 blank Space symbols (any whitespace or punctuation not otherwise
48 The parser's notion of a “letter” is determined by the database's
49 locale setting, specifically lc_ctype. Words containing only the basic
50 ASCII letters are reported as a separate token type, since it is
51 sometimes useful to distinguish them. In most European languages, token
52 types word and asciiword should be treated alike.
54 email does not support all valid email characters as defined by RFC
55 5322. Specifically, the only non-alphanumeric characters supported for
56 email user names are period, dash, and underscore.
58 tag does not support all valid tag names as defined by W3C
59 Recommendation, XML. Specifically, the only tag names supported are
60 those starting with an ASCII letter, underscore, or colon, and
61 containing only letters, digits, hyphens, underscores, periods, and
62 colons. tag also includes XML comments starting with <!-- and ending
63 with -->, and XML declarations (but note that this includes anything
64 starting with <?x and ending with >).
66 It is possible for the parser to produce overlapping tokens from the
67 same piece of text. As an example, a hyphenated word will be reported
68 both as the entire word and as each component:
69 SELECT alias, description, token FROM ts_debug('foo-bar-beta1');
70 alias | description | token
71 -----------------+------------------------------------------+---------------
72 numhword | Hyphenated word, letters and digits | foo-bar-beta1
73 hword_asciipart | Hyphenated word part, all ASCII | foo
74 blank | Space symbols | -
75 hword_asciipart | Hyphenated word part, all ASCII | bar
76 blank | Space symbols | -
77 hword_numpart | Hyphenated word part, letters and digits | beta1
79 This behavior is desirable since it allows searches to work for both
80 the whole compound word and for components. Here is another instructive
82 SELECT alias, description, token FROM ts_debug('http://example.com/stuff/index.h
84 alias | description | token
85 ----------+---------------+------------------------------
86 protocol | Protocol head | http://
87 url | URL | example.com/stuff/index.html
88 host | Host | example.com
89 url_path | URL path | /stuff/index.html