5 65.1.2. Behavior of B-Tree Operator Classes
6 65.1.3. B-Tree Support Functions
11 PostgreSQL includes an implementation of the standard btree (multi-way
12 balanced tree) index data structure. Any data type that can be sorted
13 into a well-defined linear order can be indexed by a btree index. The
14 only limitation is that an index entry cannot exceed approximately
15 one-third of a page (after TOAST compression, if applicable).
17 Because each btree operator class imposes a sort order on its data
18 type, btree operator classes (or, really, operator families) have come
19 to be used as PostgreSQL's general representation and understanding of
20 sorting semantics. Therefore, they've acquired some features that go
21 beyond what would be needed just to support btree indexes, and parts of
22 the system that are quite distant from the btree AM make use of them.
24 65.1.2. Behavior of B-Tree Operator Classes #
26 As shown in Table 36.3, a btree operator class must provide five
27 comparison operators, <, <=, =, >= and >. One might expect that <>
28 should also be part of the operator class, but it is not, because it
29 would almost never be useful to use a <> WHERE clause in an index
30 search. (For some purposes, the planner treats <> as associated with a
31 btree operator class; but it finds that operator via the = operator's
32 negator link, rather than from pg_amop.)
34 When several data types share near-identical sorting semantics, their
35 operator classes can be grouped into an operator family. Doing so is
36 advantageous because it allows the planner to make deductions about
37 cross-type comparisons. Each operator class within the family should
38 contain the single-type operators (and associated support functions)
39 for its input data type, while cross-type comparison operators and
40 support functions are “loose” in the family. It is recommendable that a
41 complete set of cross-type operators be included in the family, thus
42 ensuring that the planner can represent any comparison conditions that
43 it deduces from transitivity.
45 There are some basic assumptions that a btree operator family must
47 * An = operator must be an equivalence relation; that is, for all
48 non-null values A, B, C of the data type:
49 + A = A is true (reflexive law)
50 + if A = B, then B = A (symmetric law)
51 + if A = B and B = C, then A = C (transitive law)
52 * A < operator must be a strong ordering relation; that is, for all
53 non-null values A, B, C:
54 + A < A is false (irreflexive law)
55 + if A < B and B < C, then A < C (transitive law)
56 * Furthermore, the ordering is total; that is, for all non-null
58 + exactly one of A < B, A = B, and B < A is true (trichotomy
60 (The trichotomy law justifies the definition of the comparison
61 support function, of course.)
63 The other three operators are defined in terms of = and < in the
64 obvious way, and must act consistently with them.
66 For an operator family supporting multiple data types, the above laws
67 must hold when A, B, C are taken from any data types in the family. The
68 transitive laws are the trickiest to ensure, as in cross-type
69 situations they represent statements that the behaviors of two or three
70 different operators are consistent. As an example, it would not work to
71 put float8 and numeric into the same operator family, at least not with
72 the current semantics that numeric values are converted to float8 for
73 comparison to a float8. Because of the limited accuracy of float8, this
74 means there are distinct numeric values that will compare equal to the
75 same float8 value, and thus the transitive law would fail.
77 Another requirement for a multiple-data-type family is that any
78 implicit or binary-coercion casts that are defined between data types
79 included in the operator family must not change the associated sort
82 It should be fairly clear why a btree index requires these laws to hold
83 within a single data type: without them there is no ordering to arrange
84 the keys with. Also, index searches using a comparison key of a
85 different data type require comparisons to behave sanely across two
86 data types. The extensions to three or more data types within a family
87 are not strictly required by the btree index mechanism itself, but the
88 planner relies on them for optimization purposes.
90 65.1.3. B-Tree Support Functions #
92 As shown in Table 36.9, btree defines one required and five optional
93 support functions. The six user-defined methods are:
96 For each combination of data types that a btree operator family
97 provides comparison operators for, it must provide a comparison
98 support function, registered in pg_amproc with support function
99 number 1 and amproclefttype/amprocrighttype equal to the left
100 and right data types for the comparison (i.e., the same data
101 types that the matching operators are registered with in
102 pg_amop). The comparison function must take two non-null values
103 A and B and return an int32 value that is < 0, 0, or > 0 when A
104 < B, A = B, or A > B, respectively. A null result is disallowed:
105 all values of the data type must be comparable. See
106 src/backend/access/nbtree/nbtcompare.c for examples.
108 If the compared values are of a collatable data type, the
109 appropriate collation OID will be passed to the comparison
110 support function, using the standard PG_GET_COLLATION()
114 Optionally, a btree operator family may provide sort support
115 function(s), registered under support function number 2. These
116 functions allow implementing comparisons for sorting purposes in
117 a more efficient way than naively calling the comparison support
118 function. The APIs involved in this are defined in
119 src/include/utils/sortsupport.h.
122 Optionally, a btree operator family may provide in_range support
123 function(s), registered under support function number 3. These
124 are not used during btree index operations; rather, they extend
125 the semantics of the operator family so that it can support
126 window clauses containing the RANGE offset PRECEDING and RANGE
127 offset FOLLOWING frame bound types (see Section 4.2.8).
128 Fundamentally, the extra information provided is how to add or
129 subtract an offset value in a way that is compatible with the
130 family's data ordering.
132 An in_range function must have the signature
134 in_range(val type1, base type1, offset type2, sub bool, less bool)
137 val and base must be of the same type, which is one of the types
138 supported by the operator family (i.e., a type for which it
139 provides an ordering). However, offset could be of a different
140 type, which might be one otherwise unsupported by the family. An
141 example is that the built-in time_ops family provides an
142 in_range function that has offset of type interval. A family can
143 provide in_range functions for any of its supported types and
144 one or more offset types. Each in_range function should be
145 entered in pg_amproc with amproclefttype equal to type1 and
146 amprocrighttype equal to type2.
148 The essential semantics of an in_range function depend on the
149 two Boolean flag parameters. It should add or subtract base and
150 offset, then compare val to the result, as follows:
152 + if !sub and !less, return val >= (base + offset)
153 + if !sub and less, return val <= (base + offset)
154 + if sub and !less, return val >= (base - offset)
155 + if sub and less, return val <= (base - offset)
157 Before doing so, the function should check the sign of offset:
158 if it is less than zero, raise error
159 ERRCODE_INVALID_PRECEDING_OR_FOLLOWING_SIZE (22013) with error
160 text like “invalid preceding or following size in window
161 function”. (This is required by the SQL standard, although
162 nonstandard operator families might perhaps choose to ignore
163 this restriction, since there seems to be little semantic
164 necessity for it.) This requirement is delegated to the in_range
165 function so that the core code needn't understand what “less
166 than zero” means for a particular data type.
168 An additional expectation is that in_range functions should, if
169 practical, avoid throwing an error if base + offset or base -
170 offset would overflow. The correct comparison result can be
171 determined even if that value would be out of the data type's
172 range. Note that if the data type includes concepts such as
173 “infinity” or “NaN”, extra care may be needed to ensure that
174 in_range's results agree with the normal sort order of the
177 The results of the in_range function must be consistent with the
178 sort ordering imposed by the operator family. To be precise,
179 given any fixed values of offset and sub, then:
181 + If in_range with less = true is true for some val1 and base,
182 it must be true for every val2 <= val1 with the same base.
183 + If in_range with less = true is false for some val1 and base,
184 it must be false for every val2 >= val1 with the same base.
185 + If in_range with less = true is true for some val and base1,
186 it must be true for every base2 >= base1 with the same val.
187 + If in_range with less = true is false for some val and base1,
188 it must be false for every base2 <= base1 with the same val.
190 Analogous statements with inverted conditions hold when less =
193 If the type being ordered (type1) is collatable, the appropriate
194 collation OID will be passed to the in_range function, using the
195 standard PG_GET_COLLATION() mechanism.
197 in_range functions need not handle NULL inputs, and typically
198 will be marked strict.
201 Optionally, a btree operator family may provide equalimage
202 (“equality implies image equality”) support functions,
203 registered under support function number 4. These functions
204 allow the core code to determine when it is safe to apply the
205 btree deduplication optimization. Currently, equalimage
206 functions are only called when building or rebuilding an index.
208 An equalimage function must have the signature
210 equalimage(opcintype oid) returns bool
212 The return value is static information about an operator class
213 and collation. Returning true indicates that the order function
214 for the operator class is guaranteed to only return 0
215 (“arguments are equal”) when its A and B arguments are also
216 interchangeable without any loss of semantic information. Not
217 registering an equalimage function or returning false indicates
218 that this condition cannot be assumed to hold.
220 The opcintype argument is the pg_type.oid of the data type that
221 the operator class indexes. This is a convenience that allows
222 reuse of the same underlying equalimage function across operator
223 classes. If opcintype is a collatable data type, the appropriate
224 collation OID will be passed to the equalimage function, using
225 the standard PG_GET_COLLATION() mechanism.
227 As far as the operator class is concerned, returning true
228 indicates that deduplication is safe (or safe for the collation
229 whose OID was passed to its equalimage function). However, the
230 core code will only deem deduplication safe for an index when
231 every indexed column uses an operator class that registers an
232 equalimage function, and each function actually returns true
235 Image equality is almost the same condition as simple bitwise
236 equality. There is one subtle difference: When indexing a
237 varlena data type, the on-disk representation of two image equal
238 datums may not be bitwise equal due to inconsistent application
239 of TOAST compression on input. Formally, when an operator
240 class's equalimage function returns true, it is safe to assume
241 that the datum_image_eq() C function will always agree with the
242 operator class's order function (provided that the same
243 collation OID is passed to both the equalimage and order
246 The core code is fundamentally unable to deduce anything about
247 the “equality implies image equality” status of an operator
248 class within a multiple-data-type family based on details from
249 other operator classes in the same family. Also, it is not
250 sensible for an operator family to register a cross-type
251 equalimage function, and attempting to do so will result in an
252 error. This is because “equality implies image equality” status
253 does not just depend on sorting/equality semantics, which are
254 more or less defined at the operator family level. In general,
255 the semantics that one particular data type implements must be
256 considered separately.
258 The convention followed by the operator classes included with
259 the core PostgreSQL distribution is to register a stock, generic
260 equalimage function. Most operator classes register
261 btequalimage(), which indicates that deduplication is safe
262 unconditionally. Operator classes for collatable data types such
263 as text register btvarstrequalimage(), which indicates that
264 deduplication is safe with deterministic collations. Best
265 practice for third-party extensions is to register their own
266 custom function to retain control.
269 Optionally, a B-tree operator family may provide options
270 (“operator class specific options”) support functions,
271 registered under support function number 5. These functions
272 define a set of user-visible parameters that control operator
275 An options support function must have the signature
277 options(relopts local_relopts *) returns void
279 The function is passed a pointer to a local_relopts struct,
280 which needs to be filled with a set of operator class specific
281 options. The options can be accessed from other support
282 functions using the PG_HAS_OPCLASS_OPTIONS() and
283 PG_GET_OPCLASS_OPTIONS() macros.
285 Currently, no B-Tree operator class has an options support
286 function. B-tree doesn't allow flexible representation of keys
287 like GiST, SP-GiST, GIN and BRIN do. So, options probably
288 doesn't have much application in the current B-tree index access
289 method. Nevertheless, this support function was added to B-tree
290 for uniformity, and will probably find uses during further
291 evolution of B-tree in PostgreSQL.
294 Optionally, a btree operator family may provide a skip support
295 function, registered under support function number 6. These
296 functions give the B-tree code a way to iterate through every
297 possible value that can be represented by an operator class's
298 underlying input type, in key space order. This is used by the
299 core code when it applies the skip scan optimization. The APIs
300 involved in this are defined in src/include/utils/skipsupport.h.
302 Operator classes that do not provide a skip support function are
303 still eligible to use skip scan. The core code can still use its
304 fallback strategy, though that might be suboptimal for some
305 discrete types. It usually doesn't make sense (and may not even
306 be feasible) for operator classes on continuous types to provide
307 a skip support function.
309 It is not sensible for an operator family to register a
310 cross-type skipsupport function, and attempting to do so will
311 result in an error. This is because determining the next
312 indexable value must happen by incrementing a value copied from
313 an index tuple. The values generated must all be of the same
314 underlying data type (the “skipped” index column's opclass input
317 65.1.4. Implementation #
319 This section covers B-Tree index implementation details that may be of
320 use to advanced users. See src/backend/access/nbtree/README in the
321 source distribution for a much more detailed, internals-focused
322 description of the B-Tree implementation.
324 65.1.4.1. B-Tree Structure #
326 PostgreSQL B-Tree indexes are multi-level tree structures, where each
327 level of the tree can be used as a doubly-linked list of pages. A
328 single metapage is stored in a fixed position at the start of the first
329 segment file of the index. All other pages are either leaf pages or
330 internal pages. Leaf pages are the pages on the lowest level of the
331 tree. All other levels consist of internal pages. Each leaf page
332 contains tuples that point to table rows. Each internal page contains
333 tuples that point to the next level down in the tree. Typically, over
334 99% of all pages are leaf pages. Both internal pages and leaf pages use
335 the standard page format described in Section 66.6.
337 New leaf pages are added to a B-Tree index when an existing leaf page
338 cannot fit an incoming tuple. A page split operation makes room for
339 items that originally belonged on the overflowing page by moving a
340 portion of the items to a new page. Page splits must also insert a new
341 downlink to the new page in the parent page, which may cause the parent
342 to split in turn. Page splits “cascade upwards” in a recursive fashion.
343 When the root page finally cannot fit a new downlink, a root page split
344 operation takes place. This adds a new level to the tree structure by
345 creating a new root page that is one level above the original root
348 65.1.4.2. Bottom-up Index Deletion #
350 B-Tree indexes are not directly aware that under MVCC, there might be
351 multiple extant versions of the same logical table row; to an index,
352 each tuple is an independent object that needs its own index entry.
353 “Version churn” tuples may sometimes accumulate and adversely affect
354 query latency and throughput. This typically occurs with UPDATE-heavy
355 workloads where most individual updates cannot apply the HOT
356 optimization. Changing the value of only one column covered by one
357 index during an UPDATE always necessitates a new set of index tuples —
358 one for each and every index on the table. Note in particular that this
359 includes indexes that were not “logically modified” by the UPDATE. All
360 indexes will need a successor physical index tuple that points to the
361 latest version in the table. Each new tuple within each index will
362 generally need to coexist with the original “updated” tuple for a short
363 period of time (typically until shortly after the UPDATE transaction
366 B-Tree indexes incrementally delete version churn index tuples by
367 performing bottom-up index deletion passes. Each deletion pass is
368 triggered in reaction to an anticipated “version churn page split”.
369 This only happens with indexes that are not logically modified by
370 UPDATE statements, where concentrated build up of obsolete versions in
371 particular pages would occur otherwise. A page split will usually be
372 avoided, though it's possible that certain implementation-level
373 heuristics will fail to identify and delete even one garbage index
374 tuple (in which case a page split or deduplication pass resolves the
375 issue of an incoming new tuple not fitting on a leaf page). The
376 worst-case number of versions that any index scan must traverse (for
377 any single logical row) is an important contributor to overall system
378 responsiveness and throughput. A bottom-up index deletion pass targets
379 suspected garbage tuples in a single leaf page based on qualitative
380 distinctions involving logical rows and versions. This contrasts with
381 the “top-down” index cleanup performed by autovacuum workers, which is
382 triggered when certain quantitative table-level thresholds are exceeded
383 (see Section 24.1.6).
387 Not all deletion operations that are performed within B-Tree indexes
388 are bottom-up deletion operations. There is a distinct category of
389 index tuple deletion: simple index tuple deletion. This is a deferred
390 maintenance operation that deletes index tuples that are known to be
391 safe to delete (those whose item identifier's LP_DEAD bit is already
392 set). Like bottom-up index deletion, simple index deletion takes place
393 at the point that a page split is anticipated as a way of avoiding the
396 Simple deletion is opportunistic in the sense that it can only take
397 place when recent index scans set the LP_DEAD bits of affected items in
398 passing. Prior to PostgreSQL 14, the only category of B-Tree deletion
399 was simple deletion. The main differences between it and bottom-up
400 deletion are that only the former is opportunistically driven by the
401 activity of passing index scans, while only the latter specifically
402 targets version churn from UPDATEs that do not logically modify indexed
405 Bottom-up index deletion performs the vast majority of all garbage
406 index tuple cleanup for particular indexes with certain workloads. This
407 is expected with any B-Tree index that is subject to significant
408 version churn from UPDATEs that rarely or never logically modify the
409 columns that the index covers. The average and worst-case number of
410 versions per logical row can be kept low purely through targeted
411 incremental deletion passes. It's quite possible that the on-disk size
412 of certain indexes will never increase by even one single page/block
413 despite constant version churn from UPDATEs. Even then, an exhaustive
414 “clean sweep” by a VACUUM operation (typically run in an autovacuum
415 worker process) will eventually be required as a part of collective
416 cleanup of the table and each of its indexes.
418 Unlike VACUUM, bottom-up index deletion does not provide any strong
419 guarantees about how old the oldest garbage index tuple may be. No
420 index can be permitted to retain “floating garbage” index tuples that
421 became dead prior to a conservative cutoff point shared by the table
422 and all of its indexes collectively. This fundamental table-level
423 invariant makes it safe to recycle table TIDs. This is how it is
424 possible for distinct logical rows to reuse the same table TID over
425 time (though this can never happen with two logical rows whose
426 lifetimes span the same VACUUM cycle).
428 65.1.4.3. Deduplication #
430 A duplicate is a leaf page tuple (a tuple that points to a table row)
431 where all indexed key columns have values that match corresponding
432 column values from at least one other leaf page tuple in the same
433 index. Duplicate tuples are quite common in practice. B-Tree indexes
434 can use a special, space-efficient representation for duplicates when
435 an optional technique is enabled: deduplication.
437 Deduplication works by periodically merging groups of duplicate tuples
438 together, forming a single posting list tuple for each group. The
439 column key value(s) only appear once in this representation. This is
440 followed by a sorted array of TIDs that point to rows in the table.
441 This significantly reduces the storage size of indexes where each value
442 (or each distinct combination of column values) appears several times
443 on average. The latency of queries can be reduced significantly.
444 Overall query throughput may increase significantly. The overhead of
445 routine index vacuuming may also be reduced significantly.
449 B-Tree deduplication is just as effective with “duplicates” that
450 contain a NULL value, even though NULL values are never equal to each
451 other according to the = member of any B-Tree operator class. As far as
452 any part of the implementation that understands the on-disk B-Tree
453 structure is concerned, NULL is just another value from the domain of
456 The deduplication process occurs lazily, when a new item is inserted
457 that cannot fit on an existing leaf page, though only when index tuple
458 deletion could not free sufficient space for the new item (typically
459 deletion is briefly considered and then skipped over). Unlike GIN
460 posting list tuples, B-Tree posting list tuples do not need to expand
461 every time a new duplicate is inserted; they are merely an alternative
462 physical representation of the original logical contents of the leaf
463 page. This design prioritizes consistent performance with mixed
464 read-write workloads. Most client applications will at least see a
465 moderate performance benefit from using deduplication. Deduplication is
468 CREATE INDEX and REINDEX apply deduplication to create posting list
469 tuples, though the strategy they use is slightly different. Each group
470 of duplicate ordinary tuples encountered in the sorted input taken from
471 the table is merged into a posting list tuple before being added to the
472 current pending leaf page. Individual posting list tuples are packed
473 with as many TIDs as possible. Leaf pages are written out in the usual
474 way, without any separate deduplication pass. This strategy is
475 well-suited to CREATE INDEX and REINDEX because they are once-off batch
478 Write-heavy workloads that don't benefit from deduplication due to
479 having few or no duplicate values in indexes will incur a small, fixed
480 performance penalty (unless deduplication is explicitly disabled). The
481 deduplicate_items storage parameter can be used to disable
482 deduplication within individual indexes. There is never any performance
483 penalty with read-only workloads, since reading posting list tuples is
484 at least as efficient as reading the standard tuple representation.
485 Disabling deduplication isn't usually helpful.
487 It is sometimes possible for unique indexes (as well as unique
488 constraints) to use deduplication. This allows leaf pages to
489 temporarily “absorb” extra version churn duplicates. Deduplication in
490 unique indexes augments bottom-up index deletion, especially in cases
491 where a long-running transaction holds a snapshot that blocks garbage
492 collection. The goal is to buy time for the bottom-up index deletion
493 strategy to become effective again. Delaying page splits until a single
494 long-running transaction naturally goes away can allow a bottom-up
495 deletion pass to succeed where an earlier deletion pass failed.
499 A special heuristic is applied to determine whether a deduplication
500 pass in a unique index should take place. It can often skip straight to
501 splitting a leaf page, avoiding a performance penalty from wasting
502 cycles on unhelpful deduplication passes. If you're concerned about the
503 overhead of deduplication, consider setting deduplicate_items = off
504 selectively. Leaving deduplication enabled in unique indexes has little
507 Deduplication cannot be used in all cases due to implementation-level
508 restrictions. Deduplication safety is determined when CREATE INDEX or
511 Note that deduplication is deemed unsafe and cannot be used in the
512 following cases involving semantically significant differences among
515 * text, varchar, and char cannot use deduplication when a
516 nondeterministic collation is used. Case and accent differences
517 must be preserved among equal datums.
518 * numeric cannot use deduplication. Numeric display scale must be
519 preserved among equal datums.
520 * jsonb cannot use deduplication, since the jsonb B-Tree operator
521 class uses numeric internally.
522 * float4 and float8 cannot use deduplication. These types have
523 distinct representations for -0 and 0, which are nevertheless
524 considered equal. This difference must be preserved.
526 There is one further implementation-level restriction that may be
527 lifted in a future version of PostgreSQL:
529 * Container types (such as composite types, arrays, or range types)
530 cannot use deduplication.
532 There is one further implementation-level restriction that applies
533 regardless of the operator class or collation used:
535 * INCLUDE indexes can never use deduplication.