begriffs open source - ai-pg/blob - full-docs/src/sgml/html/unaccent.html

   1 <?xml version="1.0" encoding="UTF-8" standalone="no"?>
   2 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>F.48. unaccent — a text search dictionary which removes diacritics</title><link rel="stylesheet" type="text/css" href="stylesheet.css" /><link rev="made" href="pgsql-docs@lists.postgresql.org" /><meta name="generator" content="DocBook XSL Stylesheets Vsnapshot" /><link rel="prev" href="tsm-system-time.html" title="F.47. tsm_system_time — the SYSTEM_TIME sampling method for TABLESAMPLE" /><link rel="next" href="uuid-ossp.html" title="F.49. uuid-ossp — a UUID generator" /></head><body id="docContent" class="container-fluid col-10"><div class="navheader"><table width="100%" summary="Navigation header"><tr><th colspan="5" align="center">F.48. unaccent — a text search dictionary which removes diacritics</th></tr><tr><td width="10%" align="left"><a accesskey="p" href="tsm-system-time.html" title="F.47. tsm_system_time —&#10;   the SYSTEM_TIME sampling method for TABLESAMPLE">Prev</a> </td><td width="10%" align="left"><a accesskey="u" href="contrib.html" title="Appendix F. Additional Supplied Modules and Extensions">Up</a></td><th width="60%" align="center">Appendix F. Additional Supplied Modules and Extensions</th><td width="10%" align="right"><a accesskey="h" href="index.html" title="PostgreSQL 18.0 Documentation">Home</a></td><td width="10%" align="right"> <a accesskey="n" href="uuid-ossp.html" title="F.49. uuid-ossp — a UUID generator">Next</a></td></tr></table><hr /></div><div class="sect1" id="UNACCENT"><div class="titlepage"><div><div><h2 class="title" style="clear: both">F.48. unaccent — a text search dictionary which removes diacritics <a href="#UNACCENT" class="id_link">#</a></h2></div></div></div><div class="toc"><dl class="toc"><dt><span class="sect2"><a href="unaccent.html#UNACCENT-CONFIGURATION">F.48.1. Configuration</a></span></dt><dt><span class="sect2"><a href="unaccent.html#UNACCENT-USAGE">F.48.2. Usage</a></span></dt><dt><span class="sect2"><a href="unaccent.html#UNACCENT-FUNCTIONS">F.48.3. Functions</a></span></dt></dl></div><a id="id-1.11.7.58.2" class="indexterm"></a><p>
   3   <code class="filename">unaccent</code> is a text search dictionary that removes accents
   4   (diacritic signs) from lexemes.
   5   It's a filtering dictionary, which means its output is
   6   always passed to the next dictionary (if any), unlike the normal
   7   behavior of dictionaries.  This allows accent-insensitive processing
   8   for full text search.
   9  </p><p>
  10   The current implementation of <code class="filename">unaccent</code> cannot be used as a
  11   normalizing dictionary for the <code class="filename">thesaurus</code> dictionary.
  12  </p><p>
  13   This module is considered <span class="quote">“<span class="quote">trusted</span>”</span>, that is, it can be
  14   installed by non-superusers who have <code class="literal">CREATE</code> privilege
  15   on the current database.
  16  </p><div class="sect2" id="UNACCENT-CONFIGURATION"><div class="titlepage"><div><div><h3 class="title">F.48.1. Configuration <a href="#UNACCENT-CONFIGURATION" class="id_link">#</a></h3></div></div></div><p>
  17    An <code class="literal">unaccent</code> dictionary accepts the following options:
  18   </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
  19      <code class="literal">RULES</code> is the base name of the file containing the list of
  20      translation rules.  This file must be stored in
  21      <code class="filename">$SHAREDIR/tsearch_data/</code> (where <code class="literal">$SHAREDIR</code> means
  22      the <span class="productname">PostgreSQL</span> installation's shared-data directory).
  23      Its name must end in <code class="literal">.rules</code> (which is not to be included in
  24      the <code class="literal">RULES</code> parameter).
  25     </p></li></ul></div><p>
  26    The rules file has the following format:
  27   </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
  28      Each line represents one translation rule, consisting of a character with
  29      accent followed by a character without accent.  The first is translated
  30      into the second.  For example,
  31 </p><pre class="programlisting">
  32 À        A
  33 Á        A
  34 Â        A
  35 Ã        A
  36 Ä        A
  37 Å        A
  38 Æ        AE
  39 </pre><p>
  40      The two characters must be separated by whitespace, and any leading or
  41      trailing whitespace on a line is ignored.
  42     </p></li><li class="listitem"><p>
  43      Alternatively, if only one character is given on a line, instances of
  44      that character are deleted; this is useful in languages where accents
  45      are represented by separate characters.
  46     </p></li><li class="listitem"><p>
  47      Actually, each <span class="quote">“<span class="quote">character</span>”</span> can be any string not containing
  48      whitespace, so <code class="filename">unaccent</code> dictionaries could be used for
  49      other sorts of substring substitutions besides diacritic removal.
  50     </p></li><li class="listitem"><p>
  51      Some characters, like numeric symbols, may require whitespaces in their
  52      translation rule. It is possible to use double quotes around the translated
  53      characters in this case. A double quote needs to be escaped with a second
  54      double quote when including one in the translated character. For example:
  55 </p><pre class="programlisting">
  56 ¼      " 1/4"
  57 ½      " 1/2"
  58 ¾      " 3/4"
  59 “       """"
  60 ”       """"
  61 </pre><p>
  62     </p></li><li class="listitem"><p>
  63      As with other <span class="productname">PostgreSQL</span> text search configuration files,
  64      the rules file must be stored in UTF-8 encoding.  The data is
  65      automatically translated into the current database's encoding when
  66      loaded.  Any lines containing untranslatable characters are silently
  67      ignored, so that rules files can contain rules that are not applicable in
  68      the current encoding.
  69     </p></li></ul></div><p>
  70    A more complete example, which is directly useful for most European
  71    languages, can be found in <code class="filename">unaccent.rules</code>, which is installed
  72    in <code class="filename">$SHAREDIR/tsearch_data/</code> when the <code class="filename">unaccent</code>
  73    module is installed.  This rules file translates characters with accents
  74    to the same characters without accents, and it also expands ligatures
  75    into the equivalent series of simple characters (for example, Æ to
  76    AE).
  77   </p></div><div class="sect2" id="UNACCENT-USAGE"><div class="titlepage"><div><div><h3 class="title">F.48.2. Usage <a href="#UNACCENT-USAGE" class="id_link">#</a></h3></div></div></div><p>
  78    Installing the <code class="literal">unaccent</code> extension creates a text
  79    search template <code class="literal">unaccent</code> and a dictionary <code class="literal">unaccent</code>
  80    based on it.  The <code class="literal">unaccent</code> dictionary has the default
  81    parameter setting <code class="literal">RULES='unaccent'</code>, which makes it immediately
  82    usable with the standard <code class="filename">unaccent.rules</code> file.
  83    If you wish, you can alter the parameter, for example
  84
  85 </p><pre class="programlisting">
  86 mydb=# ALTER TEXT SEARCH DICTIONARY unaccent (RULES='my_rules');
  87 </pre><p>
  88
  89    or create new dictionaries based on the template.
  90   </p><p>
  91    To test the dictionary, you can try:
  92 </p><pre class="programlisting">
  93 mydb=# select ts_lexize('unaccent','Hôtel');
  94  ts_lexize
  95 -----------
  96  {Hotel}
  97 (1 row)
  98 </pre><p>
  99   </p><p>
 100    Here is an example showing how to insert the
 101    <code class="filename">unaccent</code> dictionary into a text search configuration:
 102 </p><pre class="programlisting">
 103 mydb=# CREATE TEXT SEARCH CONFIGURATION fr ( COPY = french );
 104 mydb=# ALTER TEXT SEARCH CONFIGURATION fr
 105         ALTER MAPPING FOR hword, hword_part, word
 106         WITH unaccent, french_stem;
 107 mydb=# select to_tsvector('fr','Hôtels de la Mer');
 108     to_tsvector
 109 -------------------
 110  'hotel':1 'mer':4
 111 (1 row)
 112
 113 mydb=# select to_tsvector('fr','Hôtel de la Mer') @@ to_tsquery('fr','Hotels');
 114  ?column?
 115 ----------
 116  t
 117 (1 row)
 118
 119 mydb=# select ts_headline('fr','Hôtel de la Mer',to_tsquery('fr','Hotels'));
 120       ts_headline
 121 ------------------------
 122  &lt;b&gt;Hôtel&lt;/b&gt; de la Mer
 123 (1 row)
 124 </pre><p>
 125   </p></div><div class="sect2" id="UNACCENT-FUNCTIONS"><div class="titlepage"><div><div><h3 class="title">F.48.3. Functions <a href="#UNACCENT-FUNCTIONS" class="id_link">#</a></h3></div></div></div><p>
 126   The <code class="function">unaccent()</code> function removes accents (diacritic signs) from
 127   a given string.  Basically, it's a wrapper around
 128   <code class="filename">unaccent</code>-type dictionaries, but it can be used outside normal
 129   text search contexts.
 130  </p><a id="id-1.11.7.58.8.3" class="indexterm"></a><pre class="synopsis">
 131 unaccent([<span class="optional"><em class="replaceable"><code>dictionary</code></em> <code class="type">regdictionary</code>, </span>] <em class="replaceable"><code>string</code></em> <code class="type">text</code>) returns <code class="type">text</code>
 132 </pre><p>
 133   If the <em class="replaceable"><code>dictionary</code></em> argument is
 134   omitted, the text search dictionary named <code class="literal">unaccent</code> and
 135   appearing in the same schema as the <code class="function">unaccent()</code>
 136   function itself is used.
 137  </p><p>
 138   For example:
 139 </p><pre class="programlisting">
 140 SELECT unaccent('unaccent', 'Hôtel');
 141 SELECT unaccent('Hôtel');
 142 </pre><p>
 143  </p></div></div><div class="navfooter"><hr /><table width="100%" summary="Navigation footer"><tr><td width="40%" align="left"><a accesskey="p" href="tsm-system-time.html" title="F.47. tsm_system_time —&#10;   the SYSTEM_TIME sampling method for TABLESAMPLE">Prev</a> </td><td width="20%" align="center"><a accesskey="u" href="contrib.html" title="Appendix F. Additional Supplied Modules and Extensions">Up</a></td><td width="40%" align="right"> <a accesskey="n" href="uuid-ossp.html" title="F.49. uuid-ossp — a UUID generator">Next</a></td></tr><tr><td width="40%" align="left" valign="top">F.47. tsm_system_time —
 144    the <code class="literal">SYSTEM_TIME</code> sampling method for <code class="literal">TABLESAMPLE</code> </td><td width="20%" align="center"><a accesskey="h" href="index.html" title="PostgreSQL 18.0 Documentation">Home</a></td><td width="40%" align="right" valign="top"> F.49. uuid-ossp — a UUID generator</td></tr></table></div></body></html>