2 23.2. Collation Support #
5 23.2.2. Managing Collations
6 23.2.3. ICU Custom Collations
8 The collation feature allows specifying the sort order and character
9 classification behavior of data per-column, or even per-operation. This
10 alleviates the restriction that the LC_COLLATE and LC_CTYPE settings of
11 a database cannot be changed after its creation.
15 Conceptually, every expression of a collatable data type has a
16 collation. (The built-in collatable data types are text, varchar, and
17 char. User-defined base types can also be marked collatable, and of
18 course a domain over a collatable data type is collatable.) If the
19 expression is a column reference, the collation of the expression is
20 the defined collation of the column. If the expression is a constant,
21 the collation is the default collation of the data type of the
22 constant. The collation of a more complex expression is derived from
23 the collations of its inputs, as described below.
25 The collation of an expression can be the “default” collation, which
26 means the locale settings defined for the database. It is also possible
27 for an expression's collation to be indeterminate. In such cases,
28 ordering operations and other operations that need to know the
31 When the database system has to perform an ordering or a character
32 classification, it uses the collation of the input expression. This
33 happens, for example, with ORDER BY clauses and function or operator
34 calls such as <. The collation to apply for an ORDER BY clause is
35 simply the collation of the sort key. The collation to apply for a
36 function or operator call is derived from the arguments, as described
37 below. In addition to comparison operators, collations are taken into
38 account by functions that convert between lower and upper case letters,
39 such as lower, upper, and initcap; by pattern matching operators; and
40 by to_char and related functions.
42 For a function or operator call, the collation that is derived by
43 examining the argument collations is used at run time for performing
44 the specified operation. If the result of the function or operator call
45 is of a collatable data type, the collation is also used at parse time
46 as the defined collation of the function or operator expression, in
47 case there is a surrounding expression that requires knowledge of its
50 The collation derivation of an expression can be implicit or explicit.
51 This distinction affects how collations are combined when multiple
52 different collations appear in an expression. An explicit collation
53 derivation occurs when a COLLATE clause is used; all other collation
54 derivations are implicit. When multiple collations need to be combined,
55 for example in a function call, the following rules are used:
56 1. If any input expression has an explicit collation derivation, then
57 all explicitly derived collations among the input expressions must
58 be the same, otherwise an error is raised. If any explicitly
59 derived collation is present, that is the result of the collation
61 2. Otherwise, all input expressions must have the same implicit
62 collation derivation or the default collation. If any non-default
63 collation is present, that is the result of the collation
64 combination. Otherwise, the result is the default collation.
65 3. If there are conflicting non-default implicit collations among the
66 input expressions, then the combination is deemed to have
67 indeterminate collation. This is not an error condition unless the
68 particular function being invoked requires knowledge of the
69 collation it should apply. If it does, an error will be raised at
72 For example, consider this table definition:
74 a text COLLATE "de_DE",
75 b text COLLATE "es_ES",
80 SELECT a < 'foo' FROM test1;
82 the < comparison is performed according to de_DE rules, because the
83 expression combines an implicitly derived collation with the default
85 SELECT a < ('foo' COLLATE "fr_FR") FROM test1;
87 the comparison is performed using fr_FR rules, because the explicit
88 collation derivation overrides the implicit one. Furthermore, given
89 SELECT a < b FROM test1;
91 the parser cannot determine which collation to apply, since the a and b
92 columns have conflicting implicit collations. Since the < operator does
93 need to know which collation to use, this will result in an error. The
94 error can be resolved by attaching an explicit collation specifier to
95 either input expression, thus:
96 SELECT a < b COLLATE "de_DE" FROM test1;
99 SELECT a COLLATE "de_DE" < b FROM test1;
101 On the other hand, the structurally similar case
102 SELECT a || b FROM test1;
104 does not result in an error, because the || operator does not care
105 about collations: its result is the same regardless of the collation.
107 The collation assigned to a function or operator's combined input
108 expressions is also considered to apply to the function or operator's
109 result, if the function or operator delivers a result of a collatable
111 SELECT * FROM test1 ORDER BY a || 'foo';
113 the ordering will be done according to de_DE rules. But this query:
114 SELECT * FROM test1 ORDER BY a || b;
116 results in an error, because even though the || operator doesn't need
117 to know a collation, the ORDER BY clause does. As before, the conflict
118 can be resolved with an explicit collation specifier:
119 SELECT * FROM test1 ORDER BY a || b COLLATE "fr_FR";
121 23.2.2. Managing Collations #
123 A collation is an SQL schema object that maps an SQL name to locales
124 provided by libraries installed in the operating system. A collation
125 definition has a provider that specifies which library supplies the
126 locale data. One standard provider name is libc, which uses the locales
127 provided by the operating system C library. These are the locales used
128 by most tools provided by the operating system. Another provider is
129 icu, which uses the external ICU library. ICU locales can only be used
130 if support for ICU was configured when PostgreSQL was built.
132 A collation object provided by libc maps to a combination of LC_COLLATE
133 and LC_CTYPE settings, as accepted by the setlocale() system library
134 call. (As the name would suggest, the main purpose of a collation is to
135 set LC_COLLATE, which controls the sort order. But it is rarely
136 necessary in practice to have an LC_CTYPE setting that is different
137 from LC_COLLATE, so it is more convenient to collect these under one
138 concept than to create another infrastructure for setting LC_CTYPE per
139 expression.) Also, a libc collation is tied to a character set encoding
140 (see Section 23.3). The same collation name may exist for different
143 A collation object provided by icu maps to a named collator provided by
144 the ICU library. ICU does not support separate “collate” and “ctype”
145 settings, so they are always the same. Also, ICU collations are
146 independent of the encoding, so there is always only one ICU collation
147 of a given name in a database.
149 23.2.2.1. Standard Collations #
151 On all platforms, the following collations are supported:
154 This SQL standard collation sorts using the Unicode Collation
155 Algorithm with the Default Unicode Collation Element Table. It
156 is available in all encodings. ICU support is required to use
157 this collation, and behavior may change if PostgreSQL is built
158 with a different version of ICU. (This collation has the same
159 behavior as the ICU root locale; see und-x-icu (for
163 This SQL standard collation sorts using the Unicode code point
164 values rather than natural language order, and only the ASCII
165 letters “A” through “Z” are treated as letters. The behavior is
166 efficient and stable across all versions. Only available for
167 encoding UTF8. (This collation has the same behavior as the libc
168 locale specification C in UTF8 encoding.)
171 This collation sorts by Unicode code point values rather than
172 natural language order. For the functions lower, initcap, and
173 upper it uses Unicode full case mapping. For pattern matching
174 (including regular expressions), it uses the Standard variant of
175 Unicode Compatibility Properties. Behavior is efficient and
176 stable within a Postgres major version. It is only available for
180 This collation sorts by Unicode code point values rather than
181 natural language order. For the functions lower, initcap, and
182 upper, it uses Unicode simple case mapping. For pattern matching
183 (including regular expressions), it uses the POSIX Compatible
184 variant of Unicode Compatibility Properties. Behavior is
185 efficient and stable within a PostgreSQL major version. This
186 collation is only available for encoding UTF8.
188 C (equivalent to POSIX)
189 The C and POSIX collations are based on “traditional C”
190 behavior. They sort by byte values rather than natural language
191 order, and only the ASCII letters “A” through “Z” are treated as
192 letters. The behavior is efficient and stable across all
193 versions for a given database encoding, but behavior may vary
194 between different database encodings.
197 The default collation selects the locale specified at database
200 Additional collations may be available depending on operating system
201 support. The efficiency and stability of these additional collations
202 depend on the collation provider, the provider version, and the locale.
204 23.2.2.2. Predefined Collations #
206 If the operating system provides support for using multiple locales
207 within a single program (newlocale and related functions), or if
208 support for ICU is configured, then when a database cluster is
209 initialized, initdb populates the system catalog pg_collation with
210 collations based on all the locales it finds in the operating system at
213 To inspect the currently available locales, use the query SELECT * FROM
214 pg_collation, or the command \dOS+ in psql.
216 23.2.2.2.1. libc Collations #
218 For example, the operating system might provide a locale named
219 de_DE.utf8. initdb would then create a collation named de_DE.utf8 for
220 encoding UTF8 that has both LC_COLLATE and LC_CTYPE set to de_DE.utf8.
221 It will also create a collation with the .utf8 tag stripped off the
222 name. So you could also use the collation under the name de_DE, which
223 is less cumbersome to write and makes the name less encoding-dependent.
224 Note that, nevertheless, the initial set of collation names is
227 The default set of collations provided by libc map directly to the
228 locales installed in the operating system, which can be listed using
229 the command locale -a. In case a libc collation is needed that has
230 different values for LC_COLLATE and LC_CTYPE, or if new locales are
231 installed in the operating system after the database system was
232 initialized, then a new collation may be created using the CREATE
233 COLLATION command. New operating system locales can also be imported en
234 masse using the pg_import_system_collations() function.
236 Within any particular database, only collations that use that
237 database's encoding are of interest. Other entries in pg_collation are
238 ignored. Thus, a stripped collation name such as de_DE can be
239 considered unique within a given database even though it would not be
240 unique globally. Use of the stripped collation names is recommended,
241 since it will make one fewer thing you need to change if you decide to
242 change to another database encoding. Note however that the default, C,
243 and POSIX collations can be used regardless of the database encoding.
245 PostgreSQL considers distinct collation objects to be incompatible even
246 when they have identical properties. Thus for example,
247 SELECT a COLLATE "C" < b COLLATE "POSIX" FROM test1;
249 will draw an error even though the C and POSIX collations have
250 identical behaviors. Mixing stripped and non-stripped collation names
251 is therefore not recommended.
253 23.2.2.2.2. ICU Collations #
255 With ICU, it is not sensible to enumerate all possible locale names.
256 ICU uses a particular naming system for locales, but there are many
257 more ways to name a locale than there are actually distinct locales.
258 initdb uses the ICU APIs to extract a set of distinct locales to
259 populate the initial set of collations. Collations provided by ICU are
260 created in the SQL environment with names in BCP 47 language tag
261 format, with a “private use” extension -x-icu appended, to distinguish
262 them from libc locales.
264 Here are some example collations that might be created:
267 German collation, default variant
270 German collation for Austria, default variant
272 (There are also, say, de-DE-x-icu or de-CH-x-icu, but as of this
273 writing, they are equivalent to de-x-icu.)
275 und-x-icu (for “undefined”) #
276 ICU “root” collation. Use this to get a reasonable
277 language-agnostic sort order.
279 Some (less frequently used) encodings are not supported by ICU. When
280 the database encoding is one of these, ICU collation entries in
281 pg_collation are ignored. Attempting to use one will draw an error
282 along the lines of “collation "de-x-icu" for encoding "WIN874" does not
285 23.2.2.3. Creating New Collation Objects #
287 If the standard and predefined collations are not sufficient, users can
288 create their own collation objects using the SQL command CREATE
291 The standard and predefined collations are in the schema pg_catalog,
292 like all predefined objects. User-defined collations should be created
293 in user schemas. This also ensures that they are saved by pg_dump.
295 23.2.2.3.1. libc Collations #
297 New libc collations can be created like this:
298 CREATE COLLATION german (provider = libc, locale = 'de_DE');
300 The exact values that are acceptable for the locale clause in this
301 command depend on the operating system. On Unix-like systems, the
302 command locale -a will show a list.
304 Since the predefined libc collations already include all collations
305 defined in the operating system when the database instance is
306 initialized, it is not often necessary to manually create new ones.
307 Reasons might be if a different naming system is desired (in which case
308 see also Section 23.2.2.3.3) or if the operating system has been
309 upgraded to provide new locale definitions (in which case see also
310 pg_import_system_collations()).
312 23.2.2.3.2. ICU Collations #
314 ICU collations can be created like:
315 CREATE COLLATION german (provider = icu, locale = 'de-DE');
317 ICU locales are specified as a BCP 47 Language Tag, but can also accept
318 most libc-style locale names. If possible, libc-style locale names are
319 transformed into language tags.
321 New ICU collations can customize collation behavior extensively by
322 including collation attributes in the language tag. See Section 23.2.3
323 for details and examples.
325 23.2.2.3.3. Copying Collations #
327 The command CREATE COLLATION can also be used to create a new collation
328 from an existing collation, which can be useful to be able to use
329 operating-system-independent collation names in applications, create
330 compatibility names, or use an ICU-provided collation under a more
331 readable name. For example:
332 CREATE COLLATION german FROM "de_DE";
333 CREATE COLLATION french FROM "fr-x-icu";
335 23.2.2.4. Nondeterministic Collations #
337 A collation is either deterministic or nondeterministic. A
338 deterministic collation uses deterministic comparisons, which means
339 that it considers strings to be equal only if they consist of the same
340 byte sequence. Nondeterministic comparison may determine strings to be
341 equal even if they consist of different bytes. Typical situations
342 include case-insensitive comparison, accent-insensitive comparison, as
343 well as comparison of strings in different Unicode normal forms. It is
344 up to the collation provider to actually implement such insensitive
345 comparisons; the deterministic flag only determines whether ties are to
346 be broken using bytewise comparison. See also Unicode Technical
347 Standard 10 for more information on the terminology.
349 To create a nondeterministic collation, specify the property
350 deterministic = false to CREATE COLLATION, for example:
351 CREATE COLLATION ndcoll (provider = icu, locale = 'und', deterministic = false);
353 This example would use the standard Unicode collation in a
354 nondeterministic way. In particular, this would allow strings in
355 different normal forms to be compared correctly. More interesting
356 examples make use of the ICU customization facilities explained above.
358 CREATE COLLATION case_insensitive (provider = icu, locale = 'und-u-ks-level2', d
359 eterministic = false);
360 CREATE COLLATION ignore_accents (provider = icu, locale = 'und-u-ks-level1-kc-tr
361 ue', deterministic = false);
363 All standard and predefined collations are deterministic, all
364 user-defined collations are deterministic by default. While
365 nondeterministic collations give a more “correct” behavior, especially
366 when considering the full power of Unicode and its many special cases,
367 they also have some drawbacks. Foremost, their use leads to a
368 performance penalty. Note, in particular, that B-tree cannot use
369 deduplication with indexes that use a nondeterministic collation. Also,
370 certain operations are not possible with nondeterministic collations,
371 such as some pattern matching operations. Therefore, they should be
372 used only in cases where they are specifically wanted.
376 To deal with text in different Unicode normalization forms, it is also
377 an option to use the functions/expressions normalize and is normalized
378 to preprocess or check the strings, instead of using nondeterministic
379 collations. There are different trade-offs for each approach.
381 23.2.3. ICU Custom Collations #
383 ICU allows extensive control over collation behavior by defining new
384 collations with collation settings as a part of the language tag. These
385 settings can modify the collation order to suit a variety of needs. For
387 -- ignore differences in accents and case
388 CREATE COLLATION ignore_accent_case (provider = icu, deterministic = false, loca
389 le = 'und-u-ks-level1');
390 SELECT 'Å' = 'A' COLLATE ignore_accent_case; -- true
391 SELECT 'z' = 'Z' COLLATE ignore_accent_case; -- true
393 -- upper case letters sort before lower case.
394 CREATE COLLATION upper_first (provider = icu, locale = 'und-u-kf-upper');
395 SELECT 'B' < 'b' COLLATE upper_first; -- true
397 -- treat digits numerically and ignore punctuation
398 CREATE COLLATION num_ignore_punct (provider = icu, deterministic = false, locale
399 = 'und-u-ka-shifted-kn');
400 SELECT 'id-45' < 'id-123' COLLATE num_ignore_punct; -- true
401 SELECT 'w;x*y-z' = 'wxyz' COLLATE num_ignore_punct; -- true
403 Many of the available options are described in Section 23.2.3.2, or see
404 Section 23.2.3.5 for more details.
406 23.2.3.1. ICU Comparison Levels #
408 Comparison of two strings (collation) in ICU is determined by a
409 multi-level process, where textual features are grouped into "levels".
410 Treatment of each level is controlled by the collation settings. Higher
411 levels correspond to finer textual features.
413 Table 23.1 shows which textual feature differences are considered
414 significant when determining equality at the given level. The Unicode
415 character U+2063 is an invisible separator, and as seen in the table,
416 is ignored for at all levels of comparison less than identic.
418 Table 23.1. ICU Collation Levels
419 Level Description 'f' = 'f' 'ab' = U&'a\2063b' 'x-y' = 'x_y' 'g' = 'G'
421 level1 Base Character true true true true true false
422 level2 Accents true true true true false false
423 level3 Case/Variants true true true false false false
424 level4 Punctuation^[a] true true false false false false
425 identic All true false false false false false
427 ^[a] only with ka-shifted; see Table 23.2
429 At every level, even with full normalization off, basic normalization
430 is performed. For example, 'á' may be composed of the code points
431 U&'\0061\0301' or the single code point U&'\00E1', and those sequences
432 will be considered equal even at the identic level. To treat any
433 difference in code point representation as distinct, use a collation
434 created with deterministic set to true.
436 23.2.3.1.1. Collation Level Examples #
438 CREATE COLLATION level3 (provider = icu, deterministic = false, locale = 'und-u-
439 ka-shifted-ks-level3');
440 CREATE COLLATION level4 (provider = icu, deterministic = false, locale = 'und-u-
441 ka-shifted-ks-level4');
442 CREATE COLLATION identic (provider = icu, deterministic = false, locale = 'und-u
443 -ka-shifted-ks-identic');
445 -- invisible separator ignored at all levels except identic
446 SELECT 'ab' = U&'a\2063b' COLLATE level4; -- true
447 SELECT 'ab' = U&'a\2063b' COLLATE identic; -- false
449 -- punctuation ignored at level3 but not at level 4
450 SELECT 'x-y' = 'x_y' COLLATE level3; -- true
451 SELECT 'x-y' = 'x_y' COLLATE level4; -- false
453 23.2.3.2. Collation Settings for an ICU Locale #
455 Table 23.2 shows the available collation settings, which can be used as
456 part of a language tag to customize a collation.
458 Table 23.2. ICU Collation Settings
459 Key Values Default Description
460 co emoji, phonebk, standard, ... standard Collation type. See
461 Section 23.2.3.5 for additional options and details.
462 ka noignore, shifted noignore If set to shifted, causes some characters
463 (e.g. punctuation or space) to be ignored in comparison. Key ks must be
464 set to level3 or lower to take effect. Set key kv to control which
465 character classes are ignored.
466 kb true, false false Backwards comparison for the level 2 differences.
467 For example, locale und-u-kb sorts 'àe' before 'aé'.
470 Separates case into a "level 2.5" that falls between accents and other
473 If set to true and ks is set to level1, will ignore accents but take
475 kf upper, lower, false false If set to upper, upper case sorts before
476 lower case. If set to lower, lower case sorts before upper case. If set
477 to false, the sort depends on the rules of the locale.
478 kn true, false false If set to true, numbers within a string are
479 treated as a single numeric value rather than a sequence of digits. For
480 example, 'id-45' sorts before 'id-123'.
483 Enable full normalization; may affect performance. Basic normalization
484 is performed even when set to false. Locales for languages that require
485 full normalization typically enable it by default.
487 Full normalization is important in some cases, such as when multiple
488 accents are applied to a single character. For example, the code point
489 sequences U&'\0065\0323\0302' and U&'\0065\0302\0323' represent an e
490 with circumflex and dot-below accents applied in different orders. With
491 full normalization on, these code point sequences are treated as equal;
492 otherwise they are unequal.
493 kr space, punct, symbol, currency, digit, script-id
495 Set to one or more of the valid values, or any BCP 47 script-id, e.g.
496 latn ("Latin") or grek ("Greek"). Multiple values are separated by "-".
498 Redefines the ordering of classes of characters; those characters
499 belonging to a class earlier in the list sort before characters
500 belonging to a class later in the list. For instance, the value
501 digit-currency-space (as part of a language tag like
502 und-u-kr-digit-currency-space) sorts punctuation before digits and
504 ks level1, level2, level3, level4, identic level3 Sensitivity (or
505 "strength") when determining equality, with level1 the least sensitive
506 to differences and identic the most sensitive to differences. See
507 Table 23.1 for details.
508 kv space, punct, symbol, currency punct Classes of characters ignored
509 during comparison at level 3. Setting to a later value includes earlier
510 values; e.g. symbol also includes punct and space in the characters to
511 be ignored. Key ka must be set to shifted and key ks must be set to
512 level3 or lower to take effect.
514 Defaults may depend on locale. The above table is not meant to be
515 complete. See Section 23.2.3.5 for additional options and details.
519 For many collation settings, you must create the collation with
520 deterministic set to false for the setting to have the desired effect
521 (see Section 23.2.2.4). Additionally, some settings only take effect
522 when the key ka is set to shifted (see Table 23.2).
524 23.2.3.3. Collation Settings Examples #
526 CREATE COLLATION "de-u-co-phonebk-x-icu" (provider = icu, locale =
527 'de-u-co-phonebk'); #
528 German collation with phone book collation type
530 CREATE COLLATION "und-u-co-emoji-x-icu" (provider = icu, locale =
532 Root collation with Emoji collation type, per Unicode Technical
535 CREATE COLLATION latinlast (provider = icu, locale =
536 'en-u-kr-grek-latn'); #
537 Sort Greek letters before Latin ones. (The default is Latin
540 CREATE COLLATION upperfirst (provider = icu, locale = 'en-u-kf-upper');
542 Sort upper-case letters before lower-case letters. (The default
543 is lower-case letters first.)
545 CREATE COLLATION special (provider = icu, locale =
546 'en-u-kf-upper-kr-grek-latn'); #
547 Combines both of the above options.
549 23.2.3.4. ICU Tailoring Rules #
551 If the options provided by the collation settings shown above are not
552 sufficient, the order of collation elements can be changed with
553 tailoring rules, whose syntax is detailed at
554 https://unicode-org.github.io/icu/userguide/collation/customization/.
556 This small example creates a collation based on the root locale with a
558 CREATE COLLATION custom (provider = icu, locale = 'und', rules = '&V << w <<< W'
561 With this rule, the letter “W” is sorted after “V”, but is treated as a
562 secondary difference similar to an accent. Rules like this are
563 contained in the locale definitions of some languages. (Of course, if a
564 locale definition already contains the desired rules, then they don't
565 need to be specified again explicitly.)
567 Here is a more complex example. The following statement sets up a
568 collation named ebcdic with rules to sort US-ASCII characters in the
569 order of the EBCDIC encoding.
570 CREATE COLLATION ebcdic (provider = icu, locale = 'und',
572 & ' ' < '.' < '<' < '(' < '+' < \|
573 < '&' < '!' < '$' < '*' < ')' < ';'
574 < '-' < '/' < ',' < '%' < '_' < '>' < '?'
575 < '`' < ':' < '#' < '@' < \' < '=' < '"'
576 <*a-r < '~' <*s-z < '^' < '[' < ']'
577 < '{' <*A-I < '}' <*J-R < '\' <*S-Z <*0-9
581 FROM (VALUES ('a'), ('b'), ('A'), ('B'), ('1'), ('2'), ('!'), ('^')) AS x(c)
582 ORDER BY c COLLATE ebcdic;
594 23.2.3.5. External References for ICU #
596 This section (Section 23.2.3) is only a brief overview of ICU behavior
597 and language tags. Refer to the following documents for technical
598 details, additional options, and new behavior:
599 * Unicode Technical Standard #35
602 * https://unicode-org.github.io/icu/userguide/locale/
603 * https://unicode-org.github.io/icu/userguide/collation/