1 <?xml version="1.0" encoding="UTF-8" standalone="no"?>
2 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>65.1. B-Tree Indexes</title><link rel="stylesheet" type="text/css" href="stylesheet.css" /><link rev="made" href="pgsql-docs@lists.postgresql.org" /><meta name="generator" content="DocBook XSL Stylesheets Vsnapshot" /><link rel="prev" href="indextypes.html" title="Chapter 65. Built-in Index Access Methods" /><link rel="next" href="gist.html" title="65.2. GiST Indexes" /></head><body id="docContent" class="container-fluid col-10"><div class="navheader"><table width="100%" summary="Navigation header"><tr><th colspan="5" align="center">65.1. B-Tree Indexes</th></tr><tr><td width="10%" align="left"><a accesskey="p" href="indextypes.html" title="Chapter 65. Built-in Index Access Methods">Prev</a> </td><td width="10%" align="left"><a accesskey="u" href="indextypes.html" title="Chapter 65. Built-in Index Access Methods">Up</a></td><th width="60%" align="center">Chapter 65. Built-in Index Access Methods</th><td width="10%" align="right"><a accesskey="h" href="index.html" title="PostgreSQL 18.0 Documentation">Home</a></td><td width="10%" align="right"> <a accesskey="n" href="gist.html" title="65.2. GiST Indexes">Next</a></td></tr></table><hr /></div><div class="sect1" id="BTREE"><div class="titlepage"><div><div><h2 class="title" style="clear: both">65.1. B-Tree Indexes <a href="#BTREE" class="id_link">#</a></h2></div></div></div><div class="toc"><dl class="toc"><dt><span class="sect2"><a href="btree.html#BTREE-INTRO">65.1.1. Introduction</a></span></dt><dt><span class="sect2"><a href="btree.html#BTREE-BEHAVIOR">65.1.2. Behavior of B-Tree Operator Classes</a></span></dt><dt><span class="sect2"><a href="btree.html#BTREE-SUPPORT-FUNCS">65.1.3. B-Tree Support Functions</a></span></dt><dt><span class="sect2"><a href="btree.html#BTREE-IMPLEMENTATION">65.1.4. Implementation</a></span></dt></dl></div><a id="id-1.10.17.2.2" class="indexterm"></a><div class="sect2" id="BTREE-INTRO"><div class="titlepage"><div><div><h3 class="title">65.1.1. Introduction <a href="#BTREE-INTRO" class="id_link">#</a></h3></div></div></div><p>
3 <span class="productname">PostgreSQL</span> includes an implementation of the
4 standard <acronym class="acronym">btree</acronym> (multi-way balanced tree) index data
5 structure. Any data type that can be sorted into a well-defined linear
6 order can be indexed by a btree index. The only limitation is that an
7 index entry cannot exceed approximately one-third of a page (after TOAST
8 compression, if applicable).
10 Because each btree operator class imposes a sort order on its data type,
11 btree operator classes (or, really, operator families) have come to be
12 used as <span class="productname">PostgreSQL</span>'s general representation
13 and understanding of sorting semantics. Therefore, they've acquired
14 some features that go beyond what would be needed just to support btree
15 indexes, and parts of the system that are quite distant from the
16 btree <acronym class="acronym">AM</acronym> make use of them.
17 </p></div><div class="sect2" id="BTREE-BEHAVIOR"><div class="titlepage"><div><div><h3 class="title">65.1.2. Behavior of B-Tree Operator Classes <a href="#BTREE-BEHAVIOR" class="id_link">#</a></h3></div></div></div><p>
18 As shown in <a class="xref" href="xindex.html#XINDEX-BTREE-STRAT-TABLE" title="Table 36.3. B-Tree Strategies">Table 36.3</a>, a btree operator
19 class must provide five comparison operators,
20 <code class="literal"><</code>,
21 <code class="literal"><=</code>,
22 <code class="literal">=</code>,
23 <code class="literal">>=</code> and
24 <code class="literal">></code>.
25 One might expect that <code class="literal"><></code> should also be part of
26 the operator class, but it is not, because it would almost never be
27 useful to use a <code class="literal"><></code> WHERE clause in an index
28 search. (For some purposes, the planner treats <code class="literal"><></code>
29 as associated with a btree operator class; but it finds that operator via
30 the <code class="literal">=</code> operator's negator link, rather than
31 from <code class="structname">pg_amop</code>.)
33 When several data types share near-identical sorting semantics, their
34 operator classes can be grouped into an operator family. Doing so is
35 advantageous because it allows the planner to make deductions about
36 cross-type comparisons. Each operator class within the family should
37 contain the single-type operators (and associated support functions)
38 for its input data type, while cross-type comparison operators and
39 support functions are <span class="quote">“<span class="quote">loose</span>”</span> in the family. It is
40 recommendable that a complete set of cross-type operators be included
41 in the family, thus ensuring that the planner can represent any
42 comparison conditions that it deduces from transitivity.
44 There are some basic assumptions that a btree operator family must
46 </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
47 An <code class="literal">=</code> operator must be an equivalence relation; that
48 is, for all non-null values <em class="replaceable"><code>A</code></em>,
49 <em class="replaceable"><code>B</code></em>, <em class="replaceable"><code>C</code></em> of the
52 </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: circle; "><li class="listitem"><p>
53 <em class="replaceable"><code>A</code></em> <code class="literal">=</code>
54 <em class="replaceable"><code>A</code></em> is true
55 (<em class="firstterm">reflexive law</em>)
56 </p></li><li class="listitem"><p>
57 if <em class="replaceable"><code>A</code></em> <code class="literal">=</code>
58 <em class="replaceable"><code>B</code></em>,
59 then <em class="replaceable"><code>B</code></em> <code class="literal">=</code>
60 <em class="replaceable"><code>A</code></em>
61 (<em class="firstterm">symmetric law</em>)
62 </p></li><li class="listitem"><p>
63 if <em class="replaceable"><code>A</code></em> <code class="literal">=</code>
64 <em class="replaceable"><code>B</code></em> and <em class="replaceable"><code>B</code></em>
65 <code class="literal">=</code> <em class="replaceable"><code>C</code></em>,
66 then <em class="replaceable"><code>A</code></em> <code class="literal">=</code>
67 <em class="replaceable"><code>C</code></em>
68 (<em class="firstterm">transitive law</em>)
69 </p></li></ul></div><p>
70 </p></li><li class="listitem"><p>
71 A <code class="literal"><</code> operator must be a strong ordering relation;
72 that is, for all non-null values <em class="replaceable"><code>A</code></em>,
73 <em class="replaceable"><code>B</code></em>, <em class="replaceable"><code>C</code></em>:
75 </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: circle; "><li class="listitem"><p>
76 <em class="replaceable"><code>A</code></em> <code class="literal"><</code>
77 <em class="replaceable"><code>A</code></em> is false
78 (<em class="firstterm">irreflexive law</em>)
79 </p></li><li class="listitem"><p>
80 if <em class="replaceable"><code>A</code></em> <code class="literal"><</code>
81 <em class="replaceable"><code>B</code></em>
82 and <em class="replaceable"><code>B</code></em> <code class="literal"><</code>
83 <em class="replaceable"><code>C</code></em>,
84 then <em class="replaceable"><code>A</code></em> <code class="literal"><</code>
85 <em class="replaceable"><code>C</code></em>
86 (<em class="firstterm">transitive law</em>)
87 </p></li></ul></div><p>
88 </p></li><li class="listitem"><p>
89 Furthermore, the ordering is total; that is, for all non-null
90 values <em class="replaceable"><code>A</code></em>, <em class="replaceable"><code>B</code></em>:
92 </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: circle; "><li class="listitem"><p>
93 exactly one of <em class="replaceable"><code>A</code></em> <code class="literal"><</code>
94 <em class="replaceable"><code>B</code></em>, <em class="replaceable"><code>A</code></em>
95 <code class="literal">=</code> <em class="replaceable"><code>B</code></em>, and
96 <em class="replaceable"><code>B</code></em> <code class="literal"><</code>
97 <em class="replaceable"><code>A</code></em> is true
98 (<em class="firstterm">trichotomy law</em>)
99 </p></li></ul></div><p>
101 (The trichotomy law justifies the definition of the comparison support
102 function, of course.)
103 </p></li></ul></div><p>
104 The other three operators are defined in terms of <code class="literal">=</code>
105 and <code class="literal"><</code> in the obvious way, and must act consistently
108 For an operator family supporting multiple data types, the above laws must
109 hold when <em class="replaceable"><code>A</code></em>, <em class="replaceable"><code>B</code></em>,
110 <em class="replaceable"><code>C</code></em> are taken from any data types in the family.
111 The transitive laws are the trickiest to ensure, as in cross-type
112 situations they represent statements that the behaviors of two or three
113 different operators are consistent.
114 As an example, it would not work to put <code class="type">float8</code>
115 and <code class="type">numeric</code> into the same operator family, at least not with
116 the current semantics that <code class="type">numeric</code> values are converted
117 to <code class="type">float8</code> for comparison to a <code class="type">float8</code>. Because
118 of the limited accuracy of <code class="type">float8</code>, this means there are
119 distinct <code class="type">numeric</code> values that will compare equal to the
120 same <code class="type">float8</code> value, and thus the transitive law would fail.
122 Another requirement for a multiple-data-type family is that any implicit
123 or binary-coercion casts that are defined between data types included in
124 the operator family must not change the associated sort ordering.
126 It should be fairly clear why a btree index requires these laws to hold
127 within a single data type: without them there is no ordering to arrange
128 the keys with. Also, index searches using a comparison key of a
129 different data type require comparisons to behave sanely across two
130 data types. The extensions to three or more data types within a family
131 are not strictly required by the btree index mechanism itself, but the
132 planner relies on them for optimization purposes.
133 </p></div><div class="sect2" id="BTREE-SUPPORT-FUNCS"><div class="titlepage"><div><div><h3 class="title">65.1.3. B-Tree Support Functions <a href="#BTREE-SUPPORT-FUNCS" class="id_link">#</a></h3></div></div></div><p>
134 As shown in <a class="xref" href="xindex.html#XINDEX-BTREE-SUPPORT-TABLE" title="Table 36.9. B-Tree Support Functions">Table 36.9</a>, btree defines
135 one required and five optional support functions. The six
136 user-defined methods are:
137 </p><div class="variablelist"><dl class="variablelist"><dt><span class="term"><code class="function">order</code></span></dt><dd><p>
138 For each combination of data types that a btree operator family
139 provides comparison operators for, it must provide a comparison
140 support function, registered in
141 <code class="structname">pg_amproc</code> with support function number 1
143 <code class="structfield">amproclefttype</code>/<code class="structfield">amprocrighttype</code>
144 equal to the left and right data types for the comparison (i.e.,
145 the same data types that the matching operators are registered
146 with in <code class="structname">pg_amop</code>). The comparison
147 function must take two non-null values
148 <em class="replaceable"><code>A</code></em> and <em class="replaceable"><code>B</code></em> and
149 return an <code class="type">int32</code> value that is
150 <code class="literal"><</code> <code class="literal">0</code>,
151 <code class="literal">0</code>, or <code class="literal">></code>
152 <code class="literal">0</code> when <em class="replaceable"><code>A</code></em>
153 <code class="literal"><</code> <em class="replaceable"><code>B</code></em>,
154 <em class="replaceable"><code>A</code></em> <code class="literal">=</code>
155 <em class="replaceable"><code>B</code></em>, or <em class="replaceable"><code>A</code></em>
156 <code class="literal">></code> <em class="replaceable"><code>B</code></em>,
157 respectively. A null result is disallowed: all values of the
158 data type must be comparable. See
159 <code class="filename">src/backend/access/nbtree/nbtcompare.c</code> for
162 If the compared values are of a collatable data type, the
163 appropriate collation OID will be passed to the comparison
164 support function, using the standard
165 <code class="function">PG_GET_COLLATION()</code> mechanism.
166 </p></dd><dt><span class="term"><code class="function">sortsupport</code></span></dt><dd><p>
167 Optionally, a btree operator family may provide <em class="firstterm">sort
168 support</em> function(s), registered under support
169 function number 2. These functions allow implementing
170 comparisons for sorting purposes in a more efficient way than
171 naively calling the comparison support function. The APIs
172 involved in this are defined in
173 <code class="filename">src/include/utils/sortsupport.h</code>.
174 </p></dd><dt><span class="term"><code class="function">in_range</code></span></dt><dd><a id="id-1.10.17.2.5.3.3.2.1" class="indexterm"></a><a id="id-1.10.17.2.5.3.3.2.2" class="indexterm"></a><p>
175 Optionally, a btree operator family may provide
176 <em class="firstterm">in_range</em> support function(s), registered
177 under support function number 3. These are not used during btree
178 index operations; rather, they extend the semantics of the
179 operator family so that it can support window clauses containing
180 the <code class="literal">RANGE</code> <em class="replaceable"><code>offset</code></em>
181 <code class="literal">PRECEDING</code> and <code class="literal">RANGE</code>
182 <em class="replaceable"><code>offset</code></em> <code class="literal">FOLLOWING</code>
183 frame bound types (see <a class="xref" href="sql-expressions.html#SYNTAX-WINDOW-FUNCTIONS" title="4.2.8. Window Function Calls">Section 4.2.8</a>). Fundamentally, the extra
184 information provided is how to add or subtract an
185 <em class="replaceable"><code>offset</code></em> value in a way that is
186 compatible with the family's data ordering.
188 An <code class="function">in_range</code> function must have the signature
189 </p><pre class="synopsis">
190 in_range(<em class="replaceable"><code>val</code></em> type1, <em class="replaceable"><code>base</code></em> type1, <em class="replaceable"><code>offset</code></em> type2, <em class="replaceable"><code>sub</code></em> bool, <em class="replaceable"><code>less</code></em> bool)
193 <em class="replaceable"><code>val</code></em> and
194 <em class="replaceable"><code>base</code></em> must be of the same type, which
195 is one of the types supported by the operator family (i.e., a
196 type for which it provides an ordering). However,
197 <em class="replaceable"><code>offset</code></em> could be of a different type,
198 which might be one otherwise unsupported by the family. An
199 example is that the built-in <code class="literal">time_ops</code> family
200 provides an <code class="function">in_range</code> function that has
201 <em class="replaceable"><code>offset</code></em> of type <code class="type">interval</code>.
202 A family can provide <code class="function">in_range</code> functions for
203 any of its supported types and one or more
204 <em class="replaceable"><code>offset</code></em> types. Each
205 <code class="function">in_range</code> function should be entered in
206 <code class="structname">pg_amproc</code> with
207 <code class="structfield">amproclefttype</code> equal to
208 <code class="type">type1</code> and <code class="structfield">amprocrighttype</code>
209 equal to <code class="type">type2</code>.
211 The essential semantics of an <code class="function">in_range</code>
212 function depend on the two Boolean flag parameters. It should
213 add or subtract <em class="replaceable"><code>base</code></em> and
214 <em class="replaceable"><code>offset</code></em>, then compare
215 <em class="replaceable"><code>val</code></em> to the result, as follows:
216 </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
217 if <code class="literal">!</code><em class="replaceable"><code>sub</code></em> and
218 <code class="literal">!</code><em class="replaceable"><code>less</code></em>, return
219 <em class="replaceable"><code>val</code></em> <code class="literal">>=</code>
220 (<em class="replaceable"><code>base</code></em> <code class="literal">+</code>
221 <em class="replaceable"><code>offset</code></em>)
222 </p></li><li class="listitem"><p>
223 if <code class="literal">!</code><em class="replaceable"><code>sub</code></em> and
224 <em class="replaceable"><code>less</code></em>, return
225 <em class="replaceable"><code>val</code></em> <code class="literal"><=</code>
226 (<em class="replaceable"><code>base</code></em> <code class="literal">+</code>
227 <em class="replaceable"><code>offset</code></em>)
228 </p></li><li class="listitem"><p>
229 if <em class="replaceable"><code>sub</code></em> and
230 <code class="literal">!</code><em class="replaceable"><code>less</code></em>, return
231 <em class="replaceable"><code>val</code></em> <code class="literal">>=</code>
232 (<em class="replaceable"><code>base</code></em> <code class="literal">-</code>
233 <em class="replaceable"><code>offset</code></em>)
234 </p></li><li class="listitem"><p>
235 if <em class="replaceable"><code>sub</code></em> and
236 <em class="replaceable"><code>less</code></em>, return
237 <em class="replaceable"><code>val</code></em> <code class="literal"><=</code>
238 (<em class="replaceable"><code>base</code></em> <code class="literal">-</code>
239 <em class="replaceable"><code>offset</code></em>)
240 </p></li></ul></div><p>
241 Before doing so, the function should check the sign of
242 <em class="replaceable"><code>offset</code></em>: if it is less than zero, raise
244 <code class="literal">ERRCODE_INVALID_PRECEDING_OR_FOLLOWING_SIZE</code>
245 (22013) with error text like <span class="quote">“<span class="quote">invalid preceding or
246 following size in window function</span>”</span>. (This is required by
247 the SQL standard, although nonstandard operator families might
248 perhaps choose to ignore this restriction, since there seems to
249 be little semantic necessity for it.) This requirement is
250 delegated to the <code class="function">in_range</code> function so that
251 the core code needn't understand what <span class="quote">“<span class="quote">less than
252 zero</span>”</span> means for a particular data type.
254 An additional expectation is that <code class="function">in_range</code>
255 functions should, if practical, avoid throwing an error if
256 <em class="replaceable"><code>base</code></em> <code class="literal">+</code>
257 <em class="replaceable"><code>offset</code></em> or
258 <em class="replaceable"><code>base</code></em> <code class="literal">-</code>
259 <em class="replaceable"><code>offset</code></em> would overflow. The correct
260 comparison result can be determined even if that value would be
261 out of the data type's range. Note that if the data type
262 includes concepts such as <span class="quote">“<span class="quote">infinity</span>”</span> or
263 <span class="quote">“<span class="quote">NaN</span>”</span>, extra care may be needed to ensure that
264 <code class="function">in_range</code>'s results agree with the normal
265 sort order of the operator family.
267 The results of the <code class="function">in_range</code> function must be
268 consistent with the sort ordering imposed by the operator family.
269 To be precise, given any fixed values of
270 <em class="replaceable"><code>offset</code></em> and
271 <em class="replaceable"><code>sub</code></em>, then:
272 </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
273 If <code class="function">in_range</code> with
274 <em class="replaceable"><code>less</code></em> = true is true for some
275 <em class="replaceable"><code>val1</code></em> and
276 <em class="replaceable"><code>base</code></em>, it must be true for every
277 <em class="replaceable"><code>val2</code></em> <code class="literal"><=</code>
278 <em class="replaceable"><code>val1</code></em> with the same
279 <em class="replaceable"><code>base</code></em>.
280 </p></li><li class="listitem"><p>
281 If <code class="function">in_range</code> with
282 <em class="replaceable"><code>less</code></em> = true is false for some
283 <em class="replaceable"><code>val1</code></em> and
284 <em class="replaceable"><code>base</code></em>, it must be false for every
285 <em class="replaceable"><code>val2</code></em> <code class="literal">>=</code>
286 <em class="replaceable"><code>val1</code></em> with the same
287 <em class="replaceable"><code>base</code></em>.
288 </p></li><li class="listitem"><p>
289 If <code class="function">in_range</code> with
290 <em class="replaceable"><code>less</code></em> = true is true for some
291 <em class="replaceable"><code>val</code></em> and
292 <em class="replaceable"><code>base1</code></em>, it must be true for every
293 <em class="replaceable"><code>base2</code></em> <code class="literal">>=</code>
294 <em class="replaceable"><code>base1</code></em> with the same
295 <em class="replaceable"><code>val</code></em>.
296 </p></li><li class="listitem"><p>
297 If <code class="function">in_range</code> with
298 <em class="replaceable"><code>less</code></em> = true is false for some
299 <em class="replaceable"><code>val</code></em> and
300 <em class="replaceable"><code>base1</code></em>, it must be false for every
301 <em class="replaceable"><code>base2</code></em> <code class="literal"><=</code>
302 <em class="replaceable"><code>base1</code></em> with the same
303 <em class="replaceable"><code>val</code></em>.
304 </p></li></ul></div><p>
305 Analogous statements with inverted conditions hold when
306 <em class="replaceable"><code>less</code></em> = false.
308 If the type being ordered (<code class="type">type1</code>) is collatable, the
309 appropriate collation OID will be passed to the
310 <code class="function">in_range</code> function, using the standard
311 PG_GET_COLLATION() mechanism.
313 <code class="function">in_range</code> functions need not handle NULL
314 inputs, and typically will be marked strict.
315 </p></dd><dt><span class="term"><code class="function">equalimage</code></span></dt><dd><p>
316 Optionally, a btree operator family may provide
317 <code class="function">equalimage</code> (<span class="quote">“<span class="quote">equality implies image
318 equality</span>”</span>) support functions, registered under support
319 function number 4. These functions allow the core code to
320 determine when it is safe to apply the btree deduplication
321 optimization. Currently, <code class="function">equalimage</code>
322 functions are only called when building or rebuilding an index.
324 An <code class="function">equalimage</code> function must have the
326 </p><pre class="synopsis">
327 equalimage(<em class="replaceable"><code>opcintype</code></em> <code class="type">oid</code>) returns bool
329 The return value is static information about an operator class
330 and collation. Returning <code class="literal">true</code> indicates that
331 the <code class="function">order</code> function for the operator class is
332 guaranteed to only return <code class="literal">0</code> (<span class="quote">“<span class="quote">arguments
333 are equal</span>”</span>) when its <em class="replaceable"><code>A</code></em> and
334 <em class="replaceable"><code>B</code></em> arguments are also interchangeable
335 without any loss of semantic information. Not registering an
336 <code class="function">equalimage</code> function or returning
337 <code class="literal">false</code> indicates that this condition cannot be
340 The <em class="replaceable"><code>opcintype</code></em> argument is the
341 <code class="literal"><code class="structname">pg_type</code>.oid</code> of the
342 data type that the operator class indexes. This is a convenience
343 that allows reuse of the same underlying
344 <code class="function">equalimage</code> function across operator classes.
345 If <em class="replaceable"><code>opcintype</code></em> is a collatable data
346 type, the appropriate collation OID will be passed to the
347 <code class="function">equalimage</code> function, using the standard
348 <code class="function">PG_GET_COLLATION()</code> mechanism.
350 As far as the operator class is concerned, returning
351 <code class="literal">true</code> indicates that deduplication is safe (or
352 safe for the collation whose OID was passed to its
353 <code class="function">equalimage</code> function). However, the core
354 code will only deem deduplication safe for an index when
355 <span class="emphasis"><em>every</em></span> indexed column uses an operator class
356 that registers an <code class="function">equalimage</code> function, and
357 each function actually returns <code class="literal">true</code> when
360 Image equality is <span class="emphasis"><em>almost</em></span> the same condition
361 as simple bitwise equality. There is one subtle difference: When
362 indexing a varlena data type, the on-disk representation of two
363 image equal datums may not be bitwise equal due to inconsistent
364 application of <acronym class="acronym">TOAST</acronym> compression on input.
365 Formally, when an operator class's
366 <code class="function">equalimage</code> function returns
367 <code class="literal">true</code>, it is safe to assume that the
368 <code class="literal">datum_image_eq()</code> C function will always agree
369 with the operator class's <code class="function">order</code> function
370 (provided that the same collation OID is passed to both the
371 <code class="function">equalimage</code> and <code class="function">order</code>
374 The core code is fundamentally unable to deduce anything about
375 the <span class="quote">“<span class="quote">equality implies image equality</span>”</span> status of an
376 operator class within a multiple-data-type family based on
377 details from other operator classes in the same family. Also, it
378 is not sensible for an operator family to register a cross-type
379 <code class="function">equalimage</code> function, and attempting to do so
380 will result in an error. This is because <span class="quote">“<span class="quote">equality implies
381 image equality</span>”</span> status does not just depend on
382 sorting/equality semantics, which are more or less defined at the
383 operator family level. In general, the semantics that one
384 particular data type implements must be considered separately.
386 The convention followed by the operator classes included with the
387 core <span class="productname">PostgreSQL</span> distribution is to
388 register a stock, generic <code class="function">equalimage</code>
389 function. Most operator classes register
390 <code class="function">btequalimage()</code>, which indicates that
391 deduplication is safe unconditionally. Operator classes for
392 collatable data types such as <code class="type">text</code> register
393 <code class="function">btvarstrequalimage()</code>, which indicates that
394 deduplication is safe with deterministic collations. Best
395 practice for third-party extensions is to register their own
396 custom function to retain control.
397 </p></dd><dt><span class="term"><code class="function">options</code></span></dt><dd><p>
398 Optionally, a B-tree operator family may provide
399 <code class="function">options</code> (<span class="quote">“<span class="quote">operator class specific
400 options</span>”</span>) support functions, registered under support
401 function number 5. These functions define a set of user-visible
402 parameters that control operator class behavior.
404 An <code class="function">options</code> support function must have the
406 </p><pre class="synopsis">
407 options(<em class="replaceable"><code>relopts</code></em> <code class="type">local_relopts *</code>) returns void
409 The function is passed a pointer to a <code class="structname">local_relopts</code>
410 struct, which needs to be filled with a set of operator class
411 specific options. The options can be accessed from other support
412 functions using the <code class="literal">PG_HAS_OPCLASS_OPTIONS()</code> and
413 <code class="literal">PG_GET_OPCLASS_OPTIONS()</code> macros.
415 Currently, no B-Tree operator class has an <code class="function">options</code>
416 support function. B-tree doesn't allow flexible representation of keys
417 like GiST, SP-GiST, GIN and BRIN do. So, <code class="function">options</code>
418 probably doesn't have much application in the current B-tree index
419 access method. Nevertheless, this support function was added to B-tree
420 for uniformity, and will probably find uses during further
421 evolution of B-tree in <span class="productname">PostgreSQL</span>.
422 </p></dd><dt><span class="term"><code class="function">skipsupport</code></span></dt><dd><p>
423 Optionally, a btree operator family may provide a <em class="firstterm">skip
424 support</em> function, registered under support function number 6.
425 These functions give the B-tree code a way to iterate through every
426 possible value that can be represented by an operator class's underlying
427 input type, in key space order. This is used by the core code when it
428 applies the skip scan optimization. The APIs involved in this are
429 defined in <code class="filename">src/include/utils/skipsupport.h</code>.
431 Operator classes that do not provide a skip support function are still
432 eligible to use skip scan. The core code can still use its fallback
433 strategy, though that might be suboptimal for some discrete types. It
434 usually doesn't make sense (and may not even be feasible) for operator
435 classes on continuous types to provide a skip support function.
437 It is not sensible for an operator family to register a cross-type
438 <code class="function">skipsupport</code> function, and attempting to do so will
439 result in an error. This is because determining the next indexable value
440 must happen by incrementing a value copied from an index tuple. The
441 values generated must all be of the same underlying data type (the
442 <span class="quote">“<span class="quote">skipped</span>”</span> index column's opclass input type).
443 </p></dd></dl></div></div><div class="sect2" id="BTREE-IMPLEMENTATION"><div class="titlepage"><div><div><h3 class="title">65.1.4. Implementation <a href="#BTREE-IMPLEMENTATION" class="id_link">#</a></h3></div></div></div><p>
444 This section covers B-Tree index implementation details that may be
445 of use to advanced users. See
446 <code class="filename">src/backend/access/nbtree/README</code> in the source
447 distribution for a much more detailed, internals-focused description
448 of the B-Tree implementation.
449 </p><div class="sect3" id="BTREE-STRUCTURE"><div class="titlepage"><div><div><h4 class="title">65.1.4.1. B-Tree Structure <a href="#BTREE-STRUCTURE" class="id_link">#</a></h4></div></div></div><p>
450 <span class="productname">PostgreSQL</span> B-Tree indexes are
451 multi-level tree structures, where each level of the tree can be
452 used as a doubly-linked list of pages. A single metapage is stored
453 in a fixed position at the start of the first segment file of the
454 index. All other pages are either leaf pages or internal pages.
455 Leaf pages are the pages on the lowest level of the tree. All
456 other levels consist of internal pages. Each leaf page contains
457 tuples that point to table rows. Each internal page contains
458 tuples that point to the next level down in the tree. Typically,
459 over 99% of all pages are leaf pages. Both internal pages and leaf
460 pages use the standard page format described in <a class="xref" href="storage-page-layout.html" title="66.6. Database Page Layout">Section 66.6</a>.
462 New leaf pages are added to a B-Tree index when an existing leaf
463 page cannot fit an incoming tuple. A <em class="firstterm">page
464 split</em> operation makes room for items that originally
465 belonged on the overflowing page by moving a portion of the items
466 to a new page. Page splits must also insert a new
467 <em class="firstterm">downlink</em> to the new page in the parent page,
468 which may cause the parent to split in turn. Page splits
469 <span class="quote">“<span class="quote">cascade upwards</span>”</span> in a recursive fashion. When the
470 root page finally cannot fit a new downlink, a <em class="firstterm">root page
471 split</em> operation takes place. This adds a new level to
472 the tree structure by creating a new root page that is one level
473 above the original root page.
474 </p></div><div class="sect3" id="BTREE-DELETION"><div class="titlepage"><div><div><h4 class="title">65.1.4.2. Bottom-up Index Deletion <a href="#BTREE-DELETION" class="id_link">#</a></h4></div></div></div><p>
475 B-Tree indexes are not directly aware that under MVCC, there might
476 be multiple extant versions of the same logical table row; to an
477 index, each tuple is an independent object that needs its own index
478 entry. <span class="quote">“<span class="quote">Version churn</span>”</span> tuples may sometimes
479 accumulate and adversely affect query latency and throughput. This
480 typically occurs with <code class="command">UPDATE</code>-heavy workloads
481 where most individual updates cannot apply the
482 <a class="link" href="storage-hot.html" title="66.7. Heap-Only Tuples (HOT)"><acronym class="acronym">HOT</acronym> optimization.</a>
483 Changing the value of only
484 one column covered by one index during an <code class="command">UPDATE</code>
485 <span class="emphasis"><em>always</em></span> necessitates a new set of index tuples
486 — one for <span class="emphasis"><em>each and every</em></span> index on the
487 table. Note in particular that this includes indexes that were not
488 <span class="quote">“<span class="quote">logically modified</span>”</span> by the <code class="command">UPDATE</code>.
489 All indexes will need a successor physical index tuple that points
490 to the latest version in the table. Each new tuple within each
491 index will generally need to coexist with the original
492 <span class="quote">“<span class="quote">updated</span>”</span> tuple for a short period of time (typically
493 until shortly after the <code class="command">UPDATE</code> transaction
496 B-Tree indexes incrementally delete version churn index tuples by
497 performing <em class="firstterm">bottom-up index deletion</em> passes.
498 Each deletion pass is triggered in reaction to an anticipated
499 <span class="quote">“<span class="quote">version churn page split</span>”</span>. This only happens with
500 indexes that are not logically modified by
501 <code class="command">UPDATE</code> statements, where concentrated build up
502 of obsolete versions in particular pages would occur otherwise. A
503 page split will usually be avoided, though it's possible that
504 certain implementation-level heuristics will fail to identify and
505 delete even one garbage index tuple (in which case a page split or
506 deduplication pass resolves the issue of an incoming new tuple not
507 fitting on a leaf page). The worst-case number of versions that
508 any index scan must traverse (for any single logical row) is an
509 important contributor to overall system responsiveness and
510 throughput. A bottom-up index deletion pass targets suspected
511 garbage tuples in a single leaf page based on
512 <span class="emphasis"><em>qualitative</em></span> distinctions involving logical
513 rows and versions. This contrasts with the <span class="quote">“<span class="quote">top-down</span>”</span>
514 index cleanup performed by autovacuum workers, which is triggered
515 when certain <span class="emphasis"><em>quantitative</em></span> table-level
516 thresholds are exceeded (see <a class="xref" href="routine-vacuuming.html#AUTOVACUUM" title="24.1.6. The Autovacuum Daemon">Section 24.1.6</a>).
517 </p><div class="note"><h3 class="title">Note</h3><p>
518 Not all deletion operations that are performed within B-Tree
519 indexes are bottom-up deletion operations. There is a distinct
520 category of index tuple deletion: <em class="firstterm">simple index tuple
521 deletion</em>. This is a deferred maintenance operation
522 that deletes index tuples that are known to be safe to delete
523 (those whose item identifier's <code class="literal">LP_DEAD</code> bit is
524 already set). Like bottom-up index deletion, simple index
525 deletion takes place at the point that a page split is anticipated
526 as a way of avoiding the split.
528 Simple deletion is opportunistic in the sense that it can only
529 take place when recent index scans set the
530 <code class="literal">LP_DEAD</code> bits of affected items in passing.
531 Prior to <span class="productname">PostgreSQL</span> 14, the only
532 category of B-Tree deletion was simple deletion. The main
533 differences between it and bottom-up deletion are that only the
534 former is opportunistically driven by the activity of passing
535 index scans, while only the latter specifically targets version
536 churn from <code class="command">UPDATE</code>s that do not logically modify
539 Bottom-up index deletion performs the vast majority of all garbage
540 index tuple cleanup for particular indexes with certain workloads.
541 This is expected with any B-Tree index that is subject to
542 significant version churn from <code class="command">UPDATE</code>s that
543 rarely or never logically modify the columns that the index covers.
544 The average and worst-case number of versions per logical row can
545 be kept low purely through targeted incremental deletion passes.
546 It's quite possible that the on-disk size of certain indexes will
547 never increase by even one single page/block despite
548 <span class="emphasis"><em>constant</em></span> version churn from
549 <code class="command">UPDATE</code>s. Even then, an exhaustive <span class="quote">“<span class="quote">clean
550 sweep</span>”</span> by a <code class="command">VACUUM</code> operation (typically
551 run in an autovacuum worker process) will eventually be required as
552 a part of <span class="emphasis"><em>collective</em></span> cleanup of the table and
555 Unlike <code class="command">VACUUM</code>, bottom-up index deletion does not
556 provide any strong guarantees about how old the oldest garbage
557 index tuple may be. No index can be permitted to retain
558 <span class="quote">“<span class="quote">floating garbage</span>”</span> index tuples that became dead prior
559 to a conservative cutoff point shared by the table and all of its
560 indexes collectively. This fundamental table-level invariant makes
561 it safe to recycle table <acronym class="acronym">TID</acronym>s. This is how it
562 is possible for distinct logical rows to reuse the same table
563 <acronym class="acronym">TID</acronym> over time (though this can never happen with
564 two logical rows whose lifetimes span the same
565 <code class="command">VACUUM</code> cycle).
566 </p></div><div class="sect3" id="BTREE-DEDUPLICATION"><div class="titlepage"><div><div><h4 class="title">65.1.4.3. Deduplication <a href="#BTREE-DEDUPLICATION" class="id_link">#</a></h4></div></div></div><p>
567 A duplicate is a leaf page tuple (a tuple that points to a table
568 row) where <span class="emphasis"><em>all</em></span> indexed key columns have values
569 that match corresponding column values from at least one other leaf
570 page tuple in the same index. Duplicate tuples are quite common in
571 practice. B-Tree indexes can use a special, space-efficient
572 representation for duplicates when an optional technique is
573 enabled: <em class="firstterm">deduplication</em>.
575 Deduplication works by periodically merging groups of duplicate
576 tuples together, forming a single <em class="firstterm">posting list</em> tuple for each
577 group. The column key value(s) only appear once in this
578 representation. This is followed by a sorted array of
579 <acronym class="acronym">TID</acronym>s that point to rows in the table. This
580 significantly reduces the storage size of indexes where each value
581 (or each distinct combination of column values) appears several
582 times on average. The latency of queries can be reduced
583 significantly. Overall query throughput may increase
584 significantly. The overhead of routine index vacuuming may also be
585 reduced significantly.
586 </p><div class="note"><h3 class="title">Note</h3><p>
587 B-Tree deduplication is just as effective with
588 <span class="quote">“<span class="quote">duplicates</span>”</span> that contain a NULL value, even though
589 NULL values are never equal to each other according to the
590 <code class="literal">=</code> member of any B-Tree operator class. As far
591 as any part of the implementation that understands the on-disk
592 B-Tree structure is concerned, NULL is just another value from the
593 domain of indexed values.
595 The deduplication process occurs lazily, when a new item is
596 inserted that cannot fit on an existing leaf page, though only when
597 index tuple deletion could not free sufficient space for the new
598 item (typically deletion is briefly considered and then skipped
599 over). Unlike GIN posting list tuples, B-Tree posting list tuples
600 do not need to expand every time a new duplicate is inserted; they
601 are merely an alternative physical representation of the original
602 logical contents of the leaf page. This design prioritizes
603 consistent performance with mixed read-write workloads. Most
604 client applications will at least see a moderate performance
605 benefit from using deduplication. Deduplication is enabled by
608 <code class="command">CREATE INDEX</code> and <code class="command">REINDEX</code>
609 apply deduplication to create posting list tuples, though the
610 strategy they use is slightly different. Each group of duplicate
611 ordinary tuples encountered in the sorted input taken from the
612 table is merged into a posting list tuple
613 <span class="emphasis"><em>before</em></span> being added to the current pending leaf
614 page. Individual posting list tuples are packed with as many
615 <acronym class="acronym">TID</acronym>s as possible. Leaf pages are written out in
616 the usual way, without any separate deduplication pass. This
617 strategy is well-suited to <code class="command">CREATE INDEX</code> and
618 <code class="command">REINDEX</code> because they are once-off batch
621 Write-heavy workloads that don't benefit from deduplication due to
622 having few or no duplicate values in indexes will incur a small,
623 fixed performance penalty (unless deduplication is explicitly
624 disabled). The <code class="literal">deduplicate_items</code> storage
625 parameter can be used to disable deduplication within individual
626 indexes. There is never any performance penalty with read-only
627 workloads, since reading posting list tuples is at least as
628 efficient as reading the standard tuple representation. Disabling
629 deduplication isn't usually helpful.
631 It is sometimes possible for unique indexes (as well as unique
632 constraints) to use deduplication. This allows leaf pages to
633 temporarily <span class="quote">“<span class="quote">absorb</span>”</span> extra version churn duplicates.
634 Deduplication in unique indexes augments bottom-up index deletion,
635 especially in cases where a long-running transaction holds a
636 snapshot that blocks garbage collection. The goal is to buy time
637 for the bottom-up index deletion strategy to become effective
638 again. Delaying page splits until a single long-running
639 transaction naturally goes away can allow a bottom-up deletion pass
640 to succeed where an earlier deletion pass failed.
641 </p><div class="tip"><h3 class="title">Tip</h3><p>
642 A special heuristic is applied to determine whether a
643 deduplication pass in a unique index should take place. It can
644 often skip straight to splitting a leaf page, avoiding a
645 performance penalty from wasting cycles on unhelpful deduplication
646 passes. If you're concerned about the overhead of deduplication,
647 consider setting <code class="literal">deduplicate_items = off</code>
648 selectively. Leaving deduplication enabled in unique indexes has
651 Deduplication cannot be used in all cases due to
652 implementation-level restrictions. Deduplication safety is
653 determined when <code class="command">CREATE INDEX</code> or
654 <code class="command">REINDEX</code> is run.
656 Note that deduplication is deemed unsafe and cannot be used in the
657 following cases involving semantically significant differences
660 </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
661 <code class="type">text</code>, <code class="type">varchar</code>, and <code class="type">char</code>
662 cannot use deduplication when a
663 <span class="emphasis"><em>nondeterministic</em></span> collation is used. Case
664 and accent differences must be preserved among equal datums.
665 </p></li><li class="listitem"><p>
666 <code class="type">numeric</code> cannot use deduplication. Numeric display
667 scale must be preserved among equal datums.
668 </p></li><li class="listitem"><p>
669 <code class="type">jsonb</code> cannot use deduplication, since the
670 <code class="type">jsonb</code> B-Tree operator class uses
671 <code class="type">numeric</code> internally.
672 </p></li><li class="listitem"><p>
673 <code class="type">float4</code> and <code class="type">float8</code> cannot use
674 deduplication. These types have distinct representations for
675 <code class="literal">-0</code> and <code class="literal">0</code>, which are
676 nevertheless considered equal. This difference must be
678 </p></li></ul></div><p>
680 There is one further implementation-level restriction that may be
681 lifted in a future version of
682 <span class="productname">PostgreSQL</span>:
684 </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
685 Container types (such as composite types, arrays, or range
686 types) cannot use deduplication.
687 </p></li></ul></div><p>
689 There is one further implementation-level restriction that applies
690 regardless of the operator class or collation used:
692 </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
693 <code class="literal">INCLUDE</code> indexes can never use deduplication.
694 </p></li></ul></div><p>
695 </p></div></div></div><div class="navfooter"><hr /><table width="100%" summary="Navigation footer"><tr><td width="40%" align="left"><a accesskey="p" href="indextypes.html" title="Chapter 65. Built-in Index Access Methods">Prev</a> </td><td width="20%" align="center"><a accesskey="u" href="indextypes.html" title="Chapter 65. Built-in Index Access Methods">Up</a></td><td width="40%" align="right"> <a accesskey="n" href="gist.html" title="65.2. GiST Indexes">Next</a></td></tr><tr><td width="40%" align="left" valign="top">Chapter 65. Built-in Index Access Methods </td><td width="20%" align="center"><a accesskey="h" href="index.html" title="PostgreSQL 18.0 Documentation">Home</a></td><td width="40%" align="right" valign="top"> 65.2. GiST Indexes</td></tr></table></div></body></html>