begriffs open source - ai-pg/blob - full-docs/src/sgml/html/btree.html

   1 <?xml version="1.0" encoding="UTF-8" standalone="no"?>
   2 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>65.1. B-Tree Indexes</title><link rel="stylesheet" type="text/css" href="stylesheet.css" /><link rev="made" href="pgsql-docs@lists.postgresql.org" /><meta name="generator" content="DocBook XSL Stylesheets Vsnapshot" /><link rel="prev" href="indextypes.html" title="Chapter 65. Built-in Index Access Methods" /><link rel="next" href="gist.html" title="65.2. GiST Indexes" /></head><body id="docContent" class="container-fluid col-10"><div class="navheader"><table width="100%" summary="Navigation header"><tr><th colspan="5" align="center">65.1. B-Tree Indexes</th></tr><tr><td width="10%" align="left"><a accesskey="p" href="indextypes.html" title="Chapter 65. Built-in Index Access Methods">Prev</a> </td><td width="10%" align="left"><a accesskey="u" href="indextypes.html" title="Chapter 65. Built-in Index Access Methods">Up</a></td><th width="60%" align="center">Chapter 65. Built-in Index Access Methods</th><td width="10%" align="right"><a accesskey="h" href="index.html" title="PostgreSQL 18.0 Documentation">Home</a></td><td width="10%" align="right"> <a accesskey="n" href="gist.html" title="65.2. GiST Indexes">Next</a></td></tr></table><hr /></div><div class="sect1" id="BTREE"><div class="titlepage"><div><div><h2 class="title" style="clear: both">65.1. B-Tree Indexes <a href="#BTREE" class="id_link">#</a></h2></div></div></div><div class="toc"><dl class="toc"><dt><span class="sect2"><a href="btree.html#BTREE-INTRO">65.1.1. Introduction</a></span></dt><dt><span class="sect2"><a href="btree.html#BTREE-BEHAVIOR">65.1.2. Behavior of B-Tree Operator Classes</a></span></dt><dt><span class="sect2"><a href="btree.html#BTREE-SUPPORT-FUNCS">65.1.3. B-Tree Support Functions</a></span></dt><dt><span class="sect2"><a href="btree.html#BTREE-IMPLEMENTATION">65.1.4. Implementation</a></span></dt></dl></div><a id="id-1.10.17.2.2" class="indexterm"></a><div class="sect2" id="BTREE-INTRO"><div class="titlepage"><div><div><h3 class="title">65.1.1. Introduction <a href="#BTREE-INTRO" class="id_link">#</a></h3></div></div></div><p>
   3   <span class="productname">PostgreSQL</span> includes an implementation of the
   4   standard <acronym class="acronym">btree</acronym> (multi-way balanced tree) index data
   5   structure.  Any data type that can be sorted into a well-defined linear
   6   order can be indexed by a btree index.  The only limitation is that an
   7   index entry cannot exceed approximately one-third of a page (after TOAST
   8   compression, if applicable).
   9  </p><p>
  10   Because each btree operator class imposes a sort order on its data type,
  11   btree operator classes (or, really, operator families) have come to be
  12   used as <span class="productname">PostgreSQL</span>'s general representation
  13   and understanding of sorting semantics.  Therefore, they've acquired
  14   some features that go beyond what would be needed just to support btree
  15   indexes, and parts of the system that are quite distant from the
  16   btree <acronym class="acronym">AM</acronym> make use of them.
  17  </p></div><div class="sect2" id="BTREE-BEHAVIOR"><div class="titlepage"><div><div><h3 class="title">65.1.2. Behavior of B-Tree Operator Classes <a href="#BTREE-BEHAVIOR" class="id_link">#</a></h3></div></div></div><p>
  18   As shown in <a class="xref" href="xindex.html#XINDEX-BTREE-STRAT-TABLE" title="Table 36.3. B-Tree Strategies">Table 36.3</a>, a btree operator
  19   class must provide five comparison operators,
  20   <code class="literal">&lt;</code>,
  21   <code class="literal">&lt;=</code>,
  22   <code class="literal">=</code>,
  23   <code class="literal">&gt;=</code> and
  24   <code class="literal">&gt;</code>.
  25   One might expect that <code class="literal">&lt;&gt;</code> should also be part of
  26   the operator class, but it is not, because it would almost never be
  27   useful to use a <code class="literal">&lt;&gt;</code> WHERE clause in an index
  28   search.  (For some purposes, the planner treats <code class="literal">&lt;&gt;</code>
  29   as associated with a btree operator class; but it finds that operator via
  30   the <code class="literal">=</code> operator's negator link, rather than
  31   from <code class="structname">pg_amop</code>.)
  32  </p><p>
  33   When several data types share near-identical sorting semantics, their
  34   operator classes can be grouped into an operator family.  Doing so is
  35   advantageous because it allows the planner to make deductions about
  36   cross-type comparisons.  Each operator class within the family should
  37   contain the single-type operators (and associated support functions)
  38   for its input data type, while cross-type comparison operators and
  39   support functions are <span class="quote">“<span class="quote">loose</span>”</span> in the family.  It is
  40   recommendable that a complete set of cross-type operators be included
  41   in the family, thus ensuring that the planner can represent any
  42   comparison conditions that it deduces from transitivity.
  43  </p><p>
  44   There are some basic assumptions that a btree operator family must
  45   satisfy:
  46  </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
  47     An <code class="literal">=</code> operator must be an equivalence relation; that
  48     is, for all non-null values <em class="replaceable"><code>A</code></em>,
  49     <em class="replaceable"><code>B</code></em>, <em class="replaceable"><code>C</code></em> of the
  50     data type:
  51
  52     </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: circle; "><li class="listitem"><p>
  53        <em class="replaceable"><code>A</code></em> <code class="literal">=</code>
  54        <em class="replaceable"><code>A</code></em> is true
  55        (<em class="firstterm">reflexive law</em>)
  56       </p></li><li class="listitem"><p>
  57        if <em class="replaceable"><code>A</code></em> <code class="literal">=</code>
  58        <em class="replaceable"><code>B</code></em>,
  59        then <em class="replaceable"><code>B</code></em> <code class="literal">=</code>
  60        <em class="replaceable"><code>A</code></em>
  61        (<em class="firstterm">symmetric law</em>)
  62       </p></li><li class="listitem"><p>
  63        if <em class="replaceable"><code>A</code></em> <code class="literal">=</code>
  64        <em class="replaceable"><code>B</code></em> and <em class="replaceable"><code>B</code></em>
  65        <code class="literal">=</code> <em class="replaceable"><code>C</code></em>,
  66        then <em class="replaceable"><code>A</code></em> <code class="literal">=</code>
  67        <em class="replaceable"><code>C</code></em>
  68        (<em class="firstterm">transitive law</em>)
  69       </p></li></ul></div><p>
  70    </p></li><li class="listitem"><p>
  71     A <code class="literal">&lt;</code> operator must be a strong ordering relation;
  72     that is, for all non-null values <em class="replaceable"><code>A</code></em>,
  73     <em class="replaceable"><code>B</code></em>, <em class="replaceable"><code>C</code></em>:
  74
  75     </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: circle; "><li class="listitem"><p>
  76        <em class="replaceable"><code>A</code></em> <code class="literal">&lt;</code>
  77        <em class="replaceable"><code>A</code></em> is false
  78        (<em class="firstterm">irreflexive law</em>)
  79       </p></li><li class="listitem"><p>
  80        if <em class="replaceable"><code>A</code></em> <code class="literal">&lt;</code>
  81        <em class="replaceable"><code>B</code></em>
  82        and <em class="replaceable"><code>B</code></em> <code class="literal">&lt;</code>
  83        <em class="replaceable"><code>C</code></em>,
  84        then <em class="replaceable"><code>A</code></em> <code class="literal">&lt;</code>
  85        <em class="replaceable"><code>C</code></em>
  86        (<em class="firstterm">transitive law</em>)
  87       </p></li></ul></div><p>
  88    </p></li><li class="listitem"><p>
  89     Furthermore, the ordering is total; that is, for all non-null
  90     values <em class="replaceable"><code>A</code></em>, <em class="replaceable"><code>B</code></em>:
  91
  92     </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: circle; "><li class="listitem"><p>
  93        exactly one of <em class="replaceable"><code>A</code></em> <code class="literal">&lt;</code>
  94        <em class="replaceable"><code>B</code></em>, <em class="replaceable"><code>A</code></em>
  95        <code class="literal">=</code> <em class="replaceable"><code>B</code></em>, and
  96        <em class="replaceable"><code>B</code></em> <code class="literal">&lt;</code>
  97        <em class="replaceable"><code>A</code></em> is true
  98        (<em class="firstterm">trichotomy law</em>)
  99       </p></li></ul></div><p>
 100
 101     (The trichotomy law justifies the definition of the comparison support
 102     function, of course.)
 103    </p></li></ul></div><p>
 104   The other three operators are defined in terms of <code class="literal">=</code>
 105   and <code class="literal">&lt;</code> in the obvious way, and must act consistently
 106   with them.
 107  </p><p>
 108   For an operator family supporting multiple data types, the above laws must
 109   hold when <em class="replaceable"><code>A</code></em>, <em class="replaceable"><code>B</code></em>,
 110   <em class="replaceable"><code>C</code></em> are taken from any data types in the family.
 111   The transitive laws are the trickiest to ensure, as in cross-type
 112   situations they represent statements that the behaviors of two or three
 113   different operators are consistent.
 114   As an example, it would not work to put <code class="type">float8</code>
 115   and <code class="type">numeric</code> into the same operator family, at least not with
 116   the current semantics that <code class="type">numeric</code> values are converted
 117   to <code class="type">float8</code> for comparison to a <code class="type">float8</code>.  Because
 118   of the limited accuracy of <code class="type">float8</code>, this means there are
 119   distinct <code class="type">numeric</code> values that will compare equal to the
 120   same <code class="type">float8</code> value, and thus the transitive law would fail.
 121  </p><p>
 122   Another requirement for a multiple-data-type family is that any implicit
 123   or binary-coercion casts that are defined between data types included in
 124   the operator family must not change the associated sort ordering.
 125  </p><p>
 126   It should be fairly clear why a btree index requires these laws to hold
 127   within a single data type: without them there is no ordering to arrange
 128   the keys with.  Also, index searches using a comparison key of a
 129   different data type require comparisons to behave sanely across two
 130   data types.  The extensions to three or more data types within a family
 131   are not strictly required by the btree index mechanism itself, but the
 132   planner relies on them for optimization purposes.
 133  </p></div><div class="sect2" id="BTREE-SUPPORT-FUNCS"><div class="titlepage"><div><div><h3 class="title">65.1.3. B-Tree Support Functions <a href="#BTREE-SUPPORT-FUNCS" class="id_link">#</a></h3></div></div></div><p>
 134   As shown in <a class="xref" href="xindex.html#XINDEX-BTREE-SUPPORT-TABLE" title="Table 36.9. B-Tree Support Functions">Table 36.9</a>, btree defines
 135   one required and five optional support functions.  The six
 136   user-defined methods are:
 137  </p><div class="variablelist"><dl class="variablelist"><dt><span class="term"><code class="function">order</code></span></dt><dd><p>
 138      For each combination of data types that a btree operator family
 139      provides comparison operators for, it must provide a comparison
 140      support function, registered in
 141      <code class="structname">pg_amproc</code> with support function number 1
 142      and
 143      <code class="structfield">amproclefttype</code>/<code class="structfield">amprocrighttype</code>
 144      equal to the left and right data types for the comparison (i.e.,
 145      the same data types that the matching operators are registered
 146      with in <code class="structname">pg_amop</code>).  The comparison
 147      function must take two non-null values
 148      <em class="replaceable"><code>A</code></em> and <em class="replaceable"><code>B</code></em> and
 149      return an <code class="type">int32</code> value that is
 150      <code class="literal">&lt;</code> <code class="literal">0</code>,
 151      <code class="literal">0</code>, or <code class="literal">&gt;</code>
 152      <code class="literal">0</code> when <em class="replaceable"><code>A</code></em>
 153      <code class="literal">&lt;</code> <em class="replaceable"><code>B</code></em>,
 154      <em class="replaceable"><code>A</code></em> <code class="literal">=</code>
 155      <em class="replaceable"><code>B</code></em>, or <em class="replaceable"><code>A</code></em>
 156      <code class="literal">&gt;</code> <em class="replaceable"><code>B</code></em>,
 157      respectively.  A null result is disallowed: all values of the
 158      data type must be comparable.  See
 159      <code class="filename">src/backend/access/nbtree/nbtcompare.c</code> for
 160      examples.
 161     </p><p>
 162      If the compared values are of a collatable data type, the
 163      appropriate collation OID will be passed to the comparison
 164      support function, using the standard
 165      <code class="function">PG_GET_COLLATION()</code> mechanism.
 166     </p></dd><dt><span class="term"><code class="function">sortsupport</code></span></dt><dd><p>
 167      Optionally, a btree operator family may provide <em class="firstterm">sort
 168       support</em> function(s), registered under support
 169      function number 2.  These functions allow implementing
 170      comparisons for sorting purposes in a more efficient way than
 171      naively calling the comparison support function.  The APIs
 172      involved in this are defined in
 173      <code class="filename">src/include/utils/sortsupport.h</code>.
 174     </p></dd><dt><span class="term"><code class="function">in_range</code></span></dt><dd><a id="id-1.10.17.2.5.3.3.2.1" class="indexterm"></a><a id="id-1.10.17.2.5.3.3.2.2" class="indexterm"></a><p>
 175      Optionally, a btree operator family may provide
 176      <em class="firstterm">in_range</em> support function(s), registered
 177      under support function number 3.  These are not used during btree
 178      index operations; rather, they extend the semantics of the
 179      operator family so that it can support window clauses containing
 180      the <code class="literal">RANGE</code> <em class="replaceable"><code>offset</code></em>
 181      <code class="literal">PRECEDING</code> and <code class="literal">RANGE</code>
 182      <em class="replaceable"><code>offset</code></em> <code class="literal">FOLLOWING</code>
 183      frame bound types (see <a class="xref" href="sql-expressions.html#SYNTAX-WINDOW-FUNCTIONS" title="4.2.8. Window Function Calls">Section 4.2.8</a>).  Fundamentally, the extra
 184      information provided is how to add or subtract an
 185      <em class="replaceable"><code>offset</code></em> value in a way that is
 186      compatible with the family's data ordering.
 187     </p><p>
 188      An <code class="function">in_range</code> function must have the signature
 189 </p><pre class="synopsis">
 190 in_range(<em class="replaceable"><code>val</code></em> type1, <em class="replaceable"><code>base</code></em> type1, <em class="replaceable"><code>offset</code></em> type2, <em class="replaceable"><code>sub</code></em> bool, <em class="replaceable"><code>less</code></em> bool)
 191 returns bool
 192 </pre><p>
 193      <em class="replaceable"><code>val</code></em> and
 194      <em class="replaceable"><code>base</code></em> must be of the same type, which
 195      is one of the types supported by the operator family (i.e., a
 196      type for which it provides an ordering).  However,
 197      <em class="replaceable"><code>offset</code></em> could be of a different type,
 198      which might be one otherwise unsupported by the family.  An
 199      example is that the built-in <code class="literal">time_ops</code> family
 200      provides an <code class="function">in_range</code> function that has
 201      <em class="replaceable"><code>offset</code></em> of type <code class="type">interval</code>.
 202      A family can provide <code class="function">in_range</code> functions for
 203      any of its supported types and one or more
 204      <em class="replaceable"><code>offset</code></em> types.  Each
 205      <code class="function">in_range</code> function should be entered in
 206      <code class="structname">pg_amproc</code> with
 207      <code class="structfield">amproclefttype</code> equal to
 208      <code class="type">type1</code> and <code class="structfield">amprocrighttype</code>
 209      equal to <code class="type">type2</code>.
 210     </p><p>
 211      The essential semantics of an <code class="function">in_range</code>
 212      function depend on the two Boolean flag parameters.  It should
 213      add or subtract <em class="replaceable"><code>base</code></em> and
 214      <em class="replaceable"><code>offset</code></em>, then compare
 215      <em class="replaceable"><code>val</code></em> to the result, as follows:
 216      </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
 217         if <code class="literal">!</code><em class="replaceable"><code>sub</code></em> and
 218         <code class="literal">!</code><em class="replaceable"><code>less</code></em>, return
 219         <em class="replaceable"><code>val</code></em> <code class="literal">&gt;=</code>
 220         (<em class="replaceable"><code>base</code></em> <code class="literal">+</code>
 221         <em class="replaceable"><code>offset</code></em>)
 222        </p></li><li class="listitem"><p>
 223         if <code class="literal">!</code><em class="replaceable"><code>sub</code></em> and
 224         <em class="replaceable"><code>less</code></em>, return
 225         <em class="replaceable"><code>val</code></em> <code class="literal">&lt;=</code>
 226         (<em class="replaceable"><code>base</code></em> <code class="literal">+</code>
 227         <em class="replaceable"><code>offset</code></em>)
 228        </p></li><li class="listitem"><p>
 229         if <em class="replaceable"><code>sub</code></em> and
 230         <code class="literal">!</code><em class="replaceable"><code>less</code></em>, return
 231         <em class="replaceable"><code>val</code></em> <code class="literal">&gt;=</code>
 232         (<em class="replaceable"><code>base</code></em> <code class="literal">-</code>
 233         <em class="replaceable"><code>offset</code></em>)
 234        </p></li><li class="listitem"><p>
 235         if <em class="replaceable"><code>sub</code></em> and
 236         <em class="replaceable"><code>less</code></em>, return
 237         <em class="replaceable"><code>val</code></em> <code class="literal">&lt;=</code>
 238         (<em class="replaceable"><code>base</code></em> <code class="literal">-</code>
 239         <em class="replaceable"><code>offset</code></em>)
 240        </p></li></ul></div><p>
 241      Before doing so, the function should check the sign of
 242      <em class="replaceable"><code>offset</code></em>: if it is less than zero, raise
 243      error
 244      <code class="literal">ERRCODE_INVALID_PRECEDING_OR_FOLLOWING_SIZE</code>
 245      (22013) with error text like <span class="quote">“<span class="quote">invalid preceding or
 246       following size in window function</span>”</span>.  (This is required by
 247      the SQL standard, although nonstandard operator families might
 248      perhaps choose to ignore this restriction, since there seems to
 249      be little semantic necessity for it.) This requirement is
 250      delegated to the <code class="function">in_range</code> function so that
 251      the core code needn't understand what <span class="quote">“<span class="quote">less than
 252       zero</span>”</span> means for a particular data type.
 253     </p><p>
 254      An additional expectation is that <code class="function">in_range</code>
 255      functions should, if practical, avoid throwing an error if
 256      <em class="replaceable"><code>base</code></em> <code class="literal">+</code>
 257      <em class="replaceable"><code>offset</code></em> or
 258      <em class="replaceable"><code>base</code></em> <code class="literal">-</code>
 259      <em class="replaceable"><code>offset</code></em> would overflow.  The correct
 260      comparison result can be determined even if that value would be
 261      out of the data type's range.  Note that if the data type
 262      includes concepts such as <span class="quote">“<span class="quote">infinity</span>”</span> or
 263      <span class="quote">“<span class="quote">NaN</span>”</span>, extra care may be needed to ensure that
 264      <code class="function">in_range</code>'s results agree with the normal
 265      sort order of the operator family.
 266     </p><p>
 267      The results of the <code class="function">in_range</code> function must be
 268      consistent with the sort ordering imposed by the operator family.
 269      To be precise, given any fixed values of
 270      <em class="replaceable"><code>offset</code></em> and
 271      <em class="replaceable"><code>sub</code></em>, then:
 272      </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
 273         If <code class="function">in_range</code> with
 274         <em class="replaceable"><code>less</code></em> = true is true for some
 275         <em class="replaceable"><code>val1</code></em> and
 276         <em class="replaceable"><code>base</code></em>, it must be true for every
 277         <em class="replaceable"><code>val2</code></em> <code class="literal">&lt;=</code>
 278         <em class="replaceable"><code>val1</code></em> with the same
 279         <em class="replaceable"><code>base</code></em>.
 280        </p></li><li class="listitem"><p>
 281         If <code class="function">in_range</code> with
 282         <em class="replaceable"><code>less</code></em> = true is false for some
 283         <em class="replaceable"><code>val1</code></em> and
 284         <em class="replaceable"><code>base</code></em>, it must be false for every
 285         <em class="replaceable"><code>val2</code></em> <code class="literal">&gt;=</code>
 286         <em class="replaceable"><code>val1</code></em> with the same
 287         <em class="replaceable"><code>base</code></em>.
 288        </p></li><li class="listitem"><p>
 289         If <code class="function">in_range</code> with
 290         <em class="replaceable"><code>less</code></em> = true is true for some
 291         <em class="replaceable"><code>val</code></em> and
 292         <em class="replaceable"><code>base1</code></em>, it must be true for every
 293         <em class="replaceable"><code>base2</code></em> <code class="literal">&gt;=</code>
 294         <em class="replaceable"><code>base1</code></em> with the same
 295         <em class="replaceable"><code>val</code></em>.
 296        </p></li><li class="listitem"><p>
 297         If <code class="function">in_range</code> with
 298         <em class="replaceable"><code>less</code></em> = true is false for some
 299         <em class="replaceable"><code>val</code></em> and
 300         <em class="replaceable"><code>base1</code></em>, it must be false for every
 301         <em class="replaceable"><code>base2</code></em> <code class="literal">&lt;=</code>
 302         <em class="replaceable"><code>base1</code></em> with the same
 303         <em class="replaceable"><code>val</code></em>.
 304        </p></li></ul></div><p>
 305      Analogous statements with inverted conditions hold when
 306      <em class="replaceable"><code>less</code></em> = false.
 307     </p><p>
 308      If the type being ordered (<code class="type">type1</code>) is collatable, the
 309      appropriate collation OID will be passed to the
 310      <code class="function">in_range</code> function, using the standard
 311      PG_GET_COLLATION() mechanism.
 312     </p><p>
 313      <code class="function">in_range</code> functions need not handle NULL
 314      inputs, and typically will be marked strict.
 315     </p></dd><dt><span class="term"><code class="function">equalimage</code></span></dt><dd><p>
 316      Optionally, a btree operator family may provide
 317      <code class="function">equalimage</code> (<span class="quote">“<span class="quote">equality implies image
 318       equality</span>”</span>) support functions, registered under support
 319      function number 4.  These functions allow the core code to
 320      determine when it is safe to apply the btree deduplication
 321      optimization.  Currently, <code class="function">equalimage</code>
 322      functions are only called when building or rebuilding an index.
 323     </p><p>
 324      An <code class="function">equalimage</code> function must have the
 325      signature
 326 </p><pre class="synopsis">
 327 equalimage(<em class="replaceable"><code>opcintype</code></em> <code class="type">oid</code>) returns bool
 328 </pre><p>
 329      The return value is static information about an operator class
 330      and collation.  Returning <code class="literal">true</code> indicates that
 331      the <code class="function">order</code> function for the operator class is
 332      guaranteed to only return <code class="literal">0</code> (<span class="quote">“<span class="quote">arguments
 333       are equal</span>”</span>) when its <em class="replaceable"><code>A</code></em> and
 334      <em class="replaceable"><code>B</code></em> arguments are also interchangeable
 335      without any loss of semantic information.  Not registering an
 336      <code class="function">equalimage</code> function or returning
 337      <code class="literal">false</code> indicates that this condition cannot be
 338      assumed to hold.
 339     </p><p>
 340      The <em class="replaceable"><code>opcintype</code></em> argument is the
 341      <code class="literal"><code class="structname">pg_type</code>.oid</code> of the
 342      data type that the operator class indexes.  This is a convenience
 343      that allows reuse of the same underlying
 344      <code class="function">equalimage</code> function across operator classes.
 345      If <em class="replaceable"><code>opcintype</code></em> is a collatable data
 346      type, the appropriate collation OID will be passed to the
 347      <code class="function">equalimage</code> function, using the standard
 348      <code class="function">PG_GET_COLLATION()</code> mechanism.
 349     </p><p>
 350      As far as the operator class is concerned, returning
 351      <code class="literal">true</code> indicates that deduplication is safe (or
 352      safe for the collation whose OID was passed to its
 353      <code class="function">equalimage</code> function).  However, the core
 354      code will only deem deduplication safe for an index when
 355      <span class="emphasis"><em>every</em></span> indexed column uses an operator class
 356      that registers an <code class="function">equalimage</code> function, and
 357      each function actually returns <code class="literal">true</code> when
 358      called.
 359     </p><p>
 360      Image equality is <span class="emphasis"><em>almost</em></span> the same condition
 361      as simple bitwise equality.  There is one subtle difference: When
 362      indexing a varlena data type, the on-disk representation of two
 363      image equal datums may not be bitwise equal due to inconsistent
 364      application of <acronym class="acronym">TOAST</acronym> compression on input.
 365      Formally, when an operator class's
 366      <code class="function">equalimage</code> function returns
 367      <code class="literal">true</code>, it is safe to assume that the
 368      <code class="literal">datum_image_eq()</code> C function will always agree
 369      with the operator class's <code class="function">order</code> function
 370      (provided that the same collation OID is passed to both the
 371      <code class="function">equalimage</code> and <code class="function">order</code>
 372      functions).
 373     </p><p>
 374      The core code is fundamentally unable to deduce anything about
 375      the <span class="quote">“<span class="quote">equality implies image equality</span>”</span> status of an
 376      operator class within a multiple-data-type family based on
 377      details from other operator classes in the same family.  Also, it
 378      is not sensible for an operator family to register a cross-type
 379      <code class="function">equalimage</code> function, and attempting to do so
 380      will result in an error.  This is because <span class="quote">“<span class="quote">equality implies
 381       image equality</span>”</span> status does not just depend on
 382      sorting/equality semantics, which are more or less defined at the
 383      operator family level.  In general, the semantics that one
 384      particular data type implements must be considered separately.
 385     </p><p>
 386      The convention followed by the operator classes included with the
 387      core <span class="productname">PostgreSQL</span> distribution is to
 388      register a stock, generic <code class="function">equalimage</code>
 389      function.  Most operator classes register
 390      <code class="function">btequalimage()</code>, which indicates that
 391      deduplication is safe unconditionally.  Operator classes for
 392      collatable data types such as <code class="type">text</code> register
 393      <code class="function">btvarstrequalimage()</code>, which indicates that
 394      deduplication is safe with deterministic collations.  Best
 395      practice for third-party extensions is to register their own
 396      custom function to retain control.
 397     </p></dd><dt><span class="term"><code class="function">options</code></span></dt><dd><p>
 398      Optionally, a B-tree operator family may provide
 399      <code class="function">options</code> (<span class="quote">“<span class="quote">operator class specific
 400      options</span>”</span>) support functions, registered under support
 401      function number 5.  These functions define a set of user-visible
 402      parameters that control operator class behavior.
 403     </p><p>
 404      An <code class="function">options</code> support function must have the
 405      signature
 406 </p><pre class="synopsis">
 407 options(<em class="replaceable"><code>relopts</code></em> <code class="type">local_relopts *</code>) returns void
 408 </pre><p>
 409      The function is passed a pointer to a <code class="structname">local_relopts</code>
 410      struct, which needs to be filled with a set of operator class
 411      specific options.  The options can be accessed from other support
 412      functions using the <code class="literal">PG_HAS_OPCLASS_OPTIONS()</code> and
 413      <code class="literal">PG_GET_OPCLASS_OPTIONS()</code> macros.
 414     </p><p>
 415      Currently, no B-Tree operator class has an <code class="function">options</code>
 416      support function.  B-tree doesn't allow flexible representation of keys
 417      like GiST, SP-GiST, GIN and BRIN do.  So, <code class="function">options</code>
 418      probably doesn't have much application in the current B-tree index
 419      access method.  Nevertheless, this support function was added to B-tree
 420      for uniformity, and will probably find uses during further
 421      evolution of B-tree in <span class="productname">PostgreSQL</span>.
 422     </p></dd><dt><span class="term"><code class="function">skipsupport</code></span></dt><dd><p>
 423      Optionally, a btree operator family may provide a <em class="firstterm">skip
 424       support</em> function, registered under support function number 6.
 425      These functions give the B-tree code a way to iterate through every
 426      possible value that can be represented by an operator class's underlying
 427      input type, in key space order.  This is used by the core code when it
 428      applies the skip scan optimization.  The APIs involved in this are
 429      defined in <code class="filename">src/include/utils/skipsupport.h</code>.
 430     </p><p>
 431      Operator classes that do not provide a skip support function are still
 432      eligible to use skip scan.  The core code can still use its fallback
 433      strategy, though that might be suboptimal for some discrete types.  It
 434      usually doesn't make sense (and may not even be feasible) for operator
 435      classes on continuous types to provide a skip support function.
 436     </p><p>
 437      It is not sensible for an operator family to register a cross-type
 438      <code class="function">skipsupport</code> function, and attempting to do so will
 439      result in an error.  This is because determining the next indexable value
 440      must happen by incrementing a value copied from an index tuple.  The
 441      values generated must all be of the same underlying data type (the
 442      <span class="quote">“<span class="quote">skipped</span>”</span> index column's opclass input type).
 443     </p></dd></dl></div></div><div class="sect2" id="BTREE-IMPLEMENTATION"><div class="titlepage"><div><div><h3 class="title">65.1.4. Implementation <a href="#BTREE-IMPLEMENTATION" class="id_link">#</a></h3></div></div></div><p>
 444   This section covers B-Tree index implementation details that may be
 445   of use to advanced users.  See
 446   <code class="filename">src/backend/access/nbtree/README</code> in the source
 447   distribution for a much more detailed, internals-focused description
 448   of the B-Tree implementation.
 449  </p><div class="sect3" id="BTREE-STRUCTURE"><div class="titlepage"><div><div><h4 class="title">65.1.4.1. B-Tree Structure <a href="#BTREE-STRUCTURE" class="id_link">#</a></h4></div></div></div><p>
 450    <span class="productname">PostgreSQL</span> B-Tree indexes are
 451    multi-level tree structures, where each level of the tree can be
 452    used as a doubly-linked list of pages.  A single metapage is stored
 453    in a fixed position at the start of the first segment file of the
 454    index.  All other pages are either leaf pages or internal pages.
 455    Leaf pages are the pages on the lowest level of the tree.  All
 456    other levels consist of internal pages.  Each leaf page contains
 457    tuples that point to table rows.  Each internal page contains
 458    tuples that point to the next level down in the tree.  Typically,
 459    over 99% of all pages are leaf pages.  Both internal pages and leaf
 460    pages use the standard page format described in <a class="xref" href="storage-page-layout.html" title="66.6. Database Page Layout">Section 66.6</a>.
 461   </p><p>
 462    New leaf pages are added to a B-Tree index when an existing leaf
 463    page cannot fit an incoming tuple.  A <em class="firstterm">page
 464     split</em> operation makes room for items that originally
 465    belonged on the overflowing page by moving a portion of the items
 466    to a new page.  Page splits must also insert a new
 467    <em class="firstterm">downlink</em> to the new page in the parent page,
 468    which may cause the parent to split in turn.  Page splits
 469    <span class="quote">“<span class="quote">cascade upwards</span>”</span> in a recursive fashion.  When the
 470    root page finally cannot fit a new downlink, a <em class="firstterm">root page
 471     split</em> operation takes place.  This adds a new level to
 472    the tree structure by creating a new root page that is one level
 473    above the original root page.
 474   </p></div><div class="sect3" id="BTREE-DELETION"><div class="titlepage"><div><div><h4 class="title">65.1.4.2. Bottom-up Index Deletion <a href="#BTREE-DELETION" class="id_link">#</a></h4></div></div></div><p>
 475    B-Tree indexes are not directly aware that under MVCC, there might
 476    be multiple extant versions of the same logical table row; to an
 477    index, each tuple is an independent object that needs its own index
 478    entry.  <span class="quote">“<span class="quote">Version churn</span>”</span> tuples may sometimes
 479    accumulate and adversely affect query latency and throughput.  This
 480    typically occurs with <code class="command">UPDATE</code>-heavy workloads
 481    where most individual updates cannot apply the
 482    <a class="link" href="storage-hot.html" title="66.7. Heap-Only Tuples (HOT)"><acronym class="acronym">HOT</acronym> optimization.</a>
 483    Changing the value of only
 484    one column covered by one index during an <code class="command">UPDATE</code>
 485    <span class="emphasis"><em>always</em></span> necessitates a new set of index tuples
 486    — one for <span class="emphasis"><em>each and every</em></span> index on the
 487    table.  Note in particular that this includes indexes that were not
 488    <span class="quote">“<span class="quote">logically modified</span>”</span> by the <code class="command">UPDATE</code>.
 489    All indexes will need a successor physical index tuple that points
 490    to the latest version in the table.  Each new tuple within each
 491    index will generally need to coexist with the original
 492    <span class="quote">“<span class="quote">updated</span>”</span> tuple for a short period of time (typically
 493    until shortly after the <code class="command">UPDATE</code> transaction
 494    commits).
 495   </p><p>
 496    B-Tree indexes incrementally delete version churn index tuples by
 497    performing <em class="firstterm">bottom-up index deletion</em> passes.
 498    Each deletion pass is triggered in reaction to an anticipated
 499    <span class="quote">“<span class="quote">version churn page split</span>”</span>.  This only happens with
 500    indexes that are not logically modified by
 501    <code class="command">UPDATE</code> statements, where concentrated build up
 502    of obsolete versions in particular pages would occur otherwise.  A
 503    page split will usually be avoided, though it's possible that
 504    certain implementation-level heuristics will fail to identify and
 505    delete even one garbage index tuple (in which case a page split or
 506    deduplication pass resolves the issue of an incoming new tuple not
 507    fitting on a leaf page).  The worst-case number of versions that
 508    any index scan must traverse (for any single logical row) is an
 509    important contributor to overall system responsiveness and
 510    throughput.  A bottom-up index deletion pass targets suspected
 511    garbage tuples in a single leaf page based on
 512    <span class="emphasis"><em>qualitative</em></span> distinctions involving logical
 513    rows and versions.  This contrasts with the <span class="quote">“<span class="quote">top-down</span>”</span>
 514    index cleanup performed by autovacuum workers, which is triggered
 515    when certain <span class="emphasis"><em>quantitative</em></span> table-level
 516    thresholds are exceeded (see <a class="xref" href="routine-vacuuming.html#AUTOVACUUM" title="24.1.6. The Autovacuum Daemon">Section 24.1.6</a>).
 517   </p><div class="note"><h3 class="title">Note</h3><p>
 518     Not all deletion operations that are performed within B-Tree
 519     indexes are bottom-up deletion operations.  There is a distinct
 520     category of index tuple deletion: <em class="firstterm">simple index tuple
 521      deletion</em>.  This is a deferred maintenance operation
 522     that deletes index tuples that are known to be safe to delete
 523     (those whose item identifier's <code class="literal">LP_DEAD</code> bit is
 524     already set).  Like bottom-up index deletion, simple index
 525     deletion takes place at the point that a page split is anticipated
 526     as a way of avoiding the split.
 527    </p><p>
 528     Simple deletion is opportunistic in the sense that it can only
 529     take place when recent index scans set the
 530     <code class="literal">LP_DEAD</code> bits of affected items in passing.
 531     Prior to <span class="productname">PostgreSQL</span> 14, the only
 532     category of B-Tree deletion was simple deletion.  The main
 533     differences between it and bottom-up deletion are that only the
 534     former is opportunistically driven by the activity of passing
 535     index scans, while only the latter specifically targets version
 536     churn from <code class="command">UPDATE</code>s that do not logically modify
 537     indexed columns.
 538    </p></div><p>
 539    Bottom-up index deletion performs the vast majority of all garbage
 540    index tuple cleanup for particular indexes with certain workloads.
 541    This is expected with any B-Tree index that is subject to
 542    significant version churn from <code class="command">UPDATE</code>s that
 543    rarely or never logically modify the columns that the index covers.
 544    The average and worst-case number of versions per logical row can
 545    be kept low purely through targeted incremental deletion passes.
 546    It's quite possible that the on-disk size of certain indexes will
 547    never increase by even one single page/block despite
 548    <span class="emphasis"><em>constant</em></span> version churn from
 549    <code class="command">UPDATE</code>s.  Even then, an exhaustive <span class="quote">“<span class="quote">clean
 550     sweep</span>”</span> by a <code class="command">VACUUM</code> operation (typically
 551    run in an autovacuum worker process) will eventually be required as
 552    a part of <span class="emphasis"><em>collective</em></span> cleanup of the table and
 553    each of its indexes.
 554   </p><p>
 555    Unlike <code class="command">VACUUM</code>, bottom-up index deletion does not
 556    provide any strong guarantees about how old the oldest garbage
 557    index tuple may be.  No index can be permitted to retain
 558    <span class="quote">“<span class="quote">floating garbage</span>”</span> index tuples that became dead prior
 559    to a conservative cutoff point shared by the table and all of its
 560    indexes collectively.  This fundamental table-level invariant makes
 561    it safe to recycle table <acronym class="acronym">TID</acronym>s.  This is how it
 562    is possible for distinct logical rows to reuse the same table
 563    <acronym class="acronym">TID</acronym> over time (though this can never happen with
 564    two logical rows whose lifetimes span the same
 565    <code class="command">VACUUM</code> cycle).
 566   </p></div><div class="sect3" id="BTREE-DEDUPLICATION"><div class="titlepage"><div><div><h4 class="title">65.1.4.3. Deduplication <a href="#BTREE-DEDUPLICATION" class="id_link">#</a></h4></div></div></div><p>
 567    A duplicate is a leaf page tuple (a tuple that points to a table
 568    row) where <span class="emphasis"><em>all</em></span> indexed key columns have values
 569    that match corresponding column values from at least one other leaf
 570    page tuple in the same index.  Duplicate tuples are quite common in
 571    practice.  B-Tree indexes can use a special, space-efficient
 572    representation for duplicates when an optional technique is
 573    enabled: <em class="firstterm">deduplication</em>.
 574   </p><p>
 575    Deduplication works by periodically merging groups of duplicate
 576    tuples together, forming a single <em class="firstterm">posting list</em> tuple for each
 577    group.  The column key value(s) only appear once in this
 578    representation.  This is followed by a sorted array of
 579    <acronym class="acronym">TID</acronym>s that point to rows in the table.  This
 580    significantly reduces the storage size of indexes where each value
 581    (or each distinct combination of column values) appears several
 582    times on average.  The latency of queries can be reduced
 583    significantly.  Overall query throughput may increase
 584    significantly.  The overhead of routine index vacuuming may also be
 585    reduced significantly.
 586   </p><div class="note"><h3 class="title">Note</h3><p>
 587     B-Tree deduplication is just as effective with
 588     <span class="quote">“<span class="quote">duplicates</span>”</span> that contain a NULL value, even though
 589     NULL values are never equal to each other according to the
 590     <code class="literal">=</code> member of any B-Tree operator class.  As far
 591     as any part of the implementation that understands the on-disk
 592     B-Tree structure is concerned, NULL is just another value from the
 593     domain of indexed values.
 594    </p></div><p>
 595    The deduplication process occurs lazily, when a new item is
 596    inserted that cannot fit on an existing leaf page, though only when
 597    index tuple deletion could not free sufficient space for the new
 598    item (typically deletion is briefly considered and then skipped
 599    over).  Unlike GIN posting list tuples, B-Tree posting list tuples
 600    do not need to expand every time a new duplicate is inserted; they
 601    are merely an alternative physical representation of the original
 602    logical contents of the leaf page.  This design prioritizes
 603    consistent performance with mixed read-write workloads.  Most
 604    client applications will at least see a moderate performance
 605    benefit from using deduplication.  Deduplication is enabled by
 606    default.
 607   </p><p>
 608    <code class="command">CREATE INDEX</code> and <code class="command">REINDEX</code>
 609    apply deduplication to create posting list tuples, though the
 610    strategy they use is slightly different.  Each group of duplicate
 611    ordinary tuples encountered in the sorted input taken from the
 612    table is merged into a posting list tuple
 613    <span class="emphasis"><em>before</em></span> being added to the current pending leaf
 614    page.  Individual posting list tuples are packed with as many
 615    <acronym class="acronym">TID</acronym>s as possible.  Leaf pages are written out in
 616    the usual way, without any separate deduplication pass.  This
 617    strategy is well-suited to <code class="command">CREATE INDEX</code> and
 618    <code class="command">REINDEX</code> because they are once-off batch
 619    operations.
 620   </p><p>
 621    Write-heavy workloads that don't benefit from deduplication due to
 622    having few or no duplicate values in indexes will incur a small,
 623    fixed performance penalty (unless deduplication is explicitly
 624    disabled).  The <code class="literal">deduplicate_items</code> storage
 625    parameter can be used to disable deduplication within individual
 626    indexes.  There is never any performance penalty with read-only
 627    workloads, since reading posting list tuples is at least as
 628    efficient as reading the standard tuple representation.  Disabling
 629    deduplication isn't usually helpful.
 630   </p><p>
 631    It is sometimes possible for unique indexes (as well as unique
 632    constraints) to use deduplication.  This allows leaf pages to
 633    temporarily <span class="quote">“<span class="quote">absorb</span>”</span> extra version churn duplicates.
 634    Deduplication in unique indexes augments bottom-up index deletion,
 635    especially in cases where a long-running transaction holds a
 636    snapshot that blocks garbage collection.  The goal is to buy time
 637    for the bottom-up index deletion strategy to become effective
 638    again.  Delaying page splits until a single long-running
 639    transaction naturally goes away can allow a bottom-up deletion pass
 640    to succeed where an earlier deletion pass failed.
 641   </p><div class="tip"><h3 class="title">Tip</h3><p>
 642     A special heuristic is applied to determine whether a
 643     deduplication pass in a unique index should take place.  It can
 644     often skip straight to splitting a leaf page, avoiding a
 645     performance penalty from wasting cycles on unhelpful deduplication
 646     passes.  If you're concerned about the overhead of deduplication,
 647     consider setting <code class="literal">deduplicate_items = off</code>
 648     selectively.  Leaving deduplication enabled in unique indexes has
 649     little downside.
 650    </p></div><p>
 651    Deduplication cannot be used in all cases due to
 652    implementation-level restrictions.  Deduplication safety is
 653    determined when <code class="command">CREATE INDEX</code> or
 654    <code class="command">REINDEX</code> is run.
 655   </p><p>
 656    Note that deduplication is deemed unsafe and cannot be used in the
 657    following cases involving semantically significant differences
 658    among equal datums:
 659   </p><p>
 660    </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
 661       <code class="type">text</code>, <code class="type">varchar</code>, and <code class="type">char</code>
 662       cannot use deduplication when a
 663       <span class="emphasis"><em>nondeterministic</em></span> collation is used.  Case
 664       and accent differences must be preserved among equal datums.
 665      </p></li><li class="listitem"><p>
 666       <code class="type">numeric</code> cannot use deduplication.  Numeric display
 667       scale must be preserved among equal datums.
 668      </p></li><li class="listitem"><p>
 669       <code class="type">jsonb</code> cannot use deduplication, since the
 670       <code class="type">jsonb</code> B-Tree operator class uses
 671       <code class="type">numeric</code> internally.
 672      </p></li><li class="listitem"><p>
 673       <code class="type">float4</code> and <code class="type">float8</code> cannot use
 674       deduplication.  These types have distinct representations for
 675       <code class="literal">-0</code> and <code class="literal">0</code>, which are
 676       nevertheless considered equal.  This difference must be
 677       preserved.
 678      </p></li></ul></div><p>
 679   </p><p>
 680    There is one further implementation-level restriction that may be
 681    lifted in a future version of
 682    <span class="productname">PostgreSQL</span>:
 683   </p><p>
 684    </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
 685       Container types (such as composite types, arrays, or range
 686       types) cannot use deduplication.
 687      </p></li></ul></div><p>
 688   </p><p>
 689    There is one further implementation-level restriction that applies
 690    regardless of the operator class or collation used:
 691   </p><p>
 692    </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
 693       <code class="literal">INCLUDE</code> indexes can never use deduplication.
 694      </p></li></ul></div><p>
 695   </p></div></div></div><div class="navfooter"><hr /><table width="100%" summary="Navigation footer"><tr><td width="40%" align="left"><a accesskey="p" href="indextypes.html" title="Chapter 65. Built-in Index Access Methods">Prev</a> </td><td width="20%" align="center"><a accesskey="u" href="indextypes.html" title="Chapter 65. Built-in Index Access Methods">Up</a></td><td width="40%" align="right"> <a accesskey="n" href="gist.html" title="65.2. GiST Indexes">Next</a></td></tr><tr><td width="40%" align="left" valign="top">Chapter 65. Built-in Index Access Methods </td><td width="20%" align="center"><a accesskey="h" href="index.html" title="PostgreSQL 18.0 Documentation">Home</a></td><td width="40%" align="right" valign="top"> 65.2. GiST Indexes</td></tr></table></div></body></html>