begriffs open source - ai-pg/blob - full-docs/txt/btree.txt

   1
   2 65.1. B-Tree Indexes #
   3
   4    65.1.1. Introduction
   5    65.1.2. Behavior of B-Tree Operator Classes
   6    65.1.3. B-Tree Support Functions
   7    65.1.4. Implementation
   8
   9 65.1.1. Introduction #
  10
  11    PostgreSQL includes an implementation of the standard btree (multi-way
  12    balanced tree) index data structure. Any data type that can be sorted
  13    into a well-defined linear order can be indexed by a btree index. The
  14    only limitation is that an index entry cannot exceed approximately
  15    one-third of a page (after TOAST compression, if applicable).
  16
  17    Because each btree operator class imposes a sort order on its data
  18    type, btree operator classes (or, really, operator families) have come
  19    to be used as PostgreSQL's general representation and understanding of
  20    sorting semantics. Therefore, they've acquired some features that go
  21    beyond what would be needed just to support btree indexes, and parts of
  22    the system that are quite distant from the btree AM make use of them.
  23
  24 65.1.2. Behavior of B-Tree Operator Classes #
  25
  26    As shown in Table 36.3, a btree operator class must provide five
  27    comparison operators, <, <=, =, >= and >. One might expect that <>
  28    should also be part of the operator class, but it is not, because it
  29    would almost never be useful to use a <> WHERE clause in an index
  30    search. (For some purposes, the planner treats <> as associated with a
  31    btree operator class; but it finds that operator via the = operator's
  32    negator link, rather than from pg_amop.)
  33
  34    When several data types share near-identical sorting semantics, their
  35    operator classes can be grouped into an operator family. Doing so is
  36    advantageous because it allows the planner to make deductions about
  37    cross-type comparisons. Each operator class within the family should
  38    contain the single-type operators (and associated support functions)
  39    for its input data type, while cross-type comparison operators and
  40    support functions are “loose” in the family. It is recommendable that a
  41    complete set of cross-type operators be included in the family, thus
  42    ensuring that the planner can represent any comparison conditions that
  43    it deduces from transitivity.
  44
  45    There are some basic assumptions that a btree operator family must
  46    satisfy:
  47      * An = operator must be an equivalence relation; that is, for all
  48        non-null values A, B, C of the data type:
  49           + A = A is true (reflexive law)
  50           + if A = B, then B = A (symmetric law)
  51           + if A = B and B = C, then A = C (transitive law)
  52      * A < operator must be a strong ordering relation; that is, for all
  53        non-null values A, B, C:
  54           + A < A is false (irreflexive law)
  55           + if A < B and B < C, then A < C (transitive law)
  56      * Furthermore, the ordering is total; that is, for all non-null
  57        values A, B:
  58           + exactly one of A < B, A = B, and B < A is true (trichotomy
  59             law)
  60        (The trichotomy law justifies the definition of the comparison
  61        support function, of course.)
  62
  63    The other three operators are defined in terms of = and < in the
  64    obvious way, and must act consistently with them.
  65
  66    For an operator family supporting multiple data types, the above laws
  67    must hold when A, B, C are taken from any data types in the family. The
  68    transitive laws are the trickiest to ensure, as in cross-type
  69    situations they represent statements that the behaviors of two or three
  70    different operators are consistent. As an example, it would not work to
  71    put float8 and numeric into the same operator family, at least not with
  72    the current semantics that numeric values are converted to float8 for
  73    comparison to a float8. Because of the limited accuracy of float8, this
  74    means there are distinct numeric values that will compare equal to the
  75    same float8 value, and thus the transitive law would fail.
  76
  77    Another requirement for a multiple-data-type family is that any
  78    implicit or binary-coercion casts that are defined between data types
  79    included in the operator family must not change the associated sort
  80    ordering.
  81
  82    It should be fairly clear why a btree index requires these laws to hold
  83    within a single data type: without them there is no ordering to arrange
  84    the keys with. Also, index searches using a comparison key of a
  85    different data type require comparisons to behave sanely across two
  86    data types. The extensions to three or more data types within a family
  87    are not strictly required by the btree index mechanism itself, but the
  88    planner relies on them for optimization purposes.
  89
  90 65.1.3. B-Tree Support Functions #
  91
  92    As shown in Table 36.9, btree defines one required and five optional
  93    support functions. The six user-defined methods are:
  94
  95    order
  96           For each combination of data types that a btree operator family
  97           provides comparison operators for, it must provide a comparison
  98           support function, registered in pg_amproc with support function
  99           number 1 and amproclefttype/amprocrighttype equal to the left
 100           and right data types for the comparison (i.e., the same data
 101           types that the matching operators are registered with in
 102           pg_amop). The comparison function must take two non-null values
 103           A and B and return an int32 value that is < 0, 0, or > 0 when A
 104           < B, A = B, or A > B, respectively. A null result is disallowed:
 105           all values of the data type must be comparable. See
 106           src/backend/access/nbtree/nbtcompare.c for examples.
 107
 108           If the compared values are of a collatable data type, the
 109           appropriate collation OID will be passed to the comparison
 110           support function, using the standard PG_GET_COLLATION()
 111           mechanism.
 112
 113    sortsupport
 114           Optionally, a btree operator family may provide sort support
 115           function(s), registered under support function number 2. These
 116           functions allow implementing comparisons for sorting purposes in
 117           a more efficient way than naively calling the comparison support
 118           function. The APIs involved in this are defined in
 119           src/include/utils/sortsupport.h.
 120
 121    in_range
 122           Optionally, a btree operator family may provide in_range support
 123           function(s), registered under support function number 3. These
 124           are not used during btree index operations; rather, they extend
 125           the semantics of the operator family so that it can support
 126           window clauses containing the RANGE offset PRECEDING and RANGE
 127           offset FOLLOWING frame bound types (see Section 4.2.8).
 128           Fundamentally, the extra information provided is how to add or
 129           subtract an offset value in a way that is compatible with the
 130           family's data ordering.
 131
 132           An in_range function must have the signature
 133
 134 in_range(val type1, base type1, offset type2, sub bool, less bool)
 135 returns bool
 136
 137           val and base must be of the same type, which is one of the types
 138           supported by the operator family (i.e., a type for which it
 139           provides an ordering). However, offset could be of a different
 140           type, which might be one otherwise unsupported by the family. An
 141           example is that the built-in time_ops family provides an
 142           in_range function that has offset of type interval. A family can
 143           provide in_range functions for any of its supported types and
 144           one or more offset types. Each in_range function should be
 145           entered in pg_amproc with amproclefttype equal to type1 and
 146           amprocrighttype equal to type2.
 147
 148           The essential semantics of an in_range function depend on the
 149           two Boolean flag parameters. It should add or subtract base and
 150           offset, then compare val to the result, as follows:
 151
 152           + if !sub and !less, return val >= (base + offset)
 153           + if !sub and less, return val <= (base + offset)
 154           + if sub and !less, return val >= (base - offset)
 155           + if sub and less, return val <= (base - offset)
 156
 157           Before doing so, the function should check the sign of offset:
 158           if it is less than zero, raise error
 159           ERRCODE_INVALID_PRECEDING_OR_FOLLOWING_SIZE (22013) with error
 160           text like “invalid preceding or following size in window
 161           function”. (This is required by the SQL standard, although
 162           nonstandard operator families might perhaps choose to ignore
 163           this restriction, since there seems to be little semantic
 164           necessity for it.) This requirement is delegated to the in_range
 165           function so that the core code needn't understand what “less
 166           than zero” means for a particular data type.
 167
 168           An additional expectation is that in_range functions should, if
 169           practical, avoid throwing an error if base + offset or base -
 170           offset would overflow. The correct comparison result can be
 171           determined even if that value would be out of the data type's
 172           range. Note that if the data type includes concepts such as
 173           “infinity” or “NaN”, extra care may be needed to ensure that
 174           in_range's results agree with the normal sort order of the
 175           operator family.
 176
 177           The results of the in_range function must be consistent with the
 178           sort ordering imposed by the operator family. To be precise,
 179           given any fixed values of offset and sub, then:
 180
 181           + If in_range with less = true is true for some val1 and base,
 182             it must be true for every val2 <= val1 with the same base.
 183           + If in_range with less = true is false for some val1 and base,
 184             it must be false for every val2 >= val1 with the same base.
 185           + If in_range with less = true is true for some val and base1,
 186             it must be true for every base2 >= base1 with the same val.
 187           + If in_range with less = true is false for some val and base1,
 188             it must be false for every base2 <= base1 with the same val.
 189
 190           Analogous statements with inverted conditions hold when less =
 191           false.
 192
 193           If the type being ordered (type1) is collatable, the appropriate
 194           collation OID will be passed to the in_range function, using the
 195           standard PG_GET_COLLATION() mechanism.
 196
 197           in_range functions need not handle NULL inputs, and typically
 198           will be marked strict.
 199
 200    equalimage
 201           Optionally, a btree operator family may provide equalimage
 202           (“equality implies image equality”) support functions,
 203           registered under support function number 4. These functions
 204           allow the core code to determine when it is safe to apply the
 205           btree deduplication optimization. Currently, equalimage
 206           functions are only called when building or rebuilding an index.
 207
 208           An equalimage function must have the signature
 209
 210 equalimage(opcintype oid) returns bool
 211
 212           The return value is static information about an operator class
 213           and collation. Returning true indicates that the order function
 214           for the operator class is guaranteed to only return 0
 215           (“arguments are equal”) when its A and B arguments are also
 216           interchangeable without any loss of semantic information. Not
 217           registering an equalimage function or returning false indicates
 218           that this condition cannot be assumed to hold.
 219
 220           The opcintype argument is the pg_type.oid of the data type that
 221           the operator class indexes. This is a convenience that allows
 222           reuse of the same underlying equalimage function across operator
 223           classes. If opcintype is a collatable data type, the appropriate
 224           collation OID will be passed to the equalimage function, using
 225           the standard PG_GET_COLLATION() mechanism.
 226
 227           As far as the operator class is concerned, returning true
 228           indicates that deduplication is safe (or safe for the collation
 229           whose OID was passed to its equalimage function). However, the
 230           core code will only deem deduplication safe for an index when
 231           every indexed column uses an operator class that registers an
 232           equalimage function, and each function actually returns true
 233           when called.
 234
 235           Image equality is almost the same condition as simple bitwise
 236           equality. There is one subtle difference: When indexing a
 237           varlena data type, the on-disk representation of two image equal
 238           datums may not be bitwise equal due to inconsistent application
 239           of TOAST compression on input. Formally, when an operator
 240           class's equalimage function returns true, it is safe to assume
 241           that the datum_image_eq() C function will always agree with the
 242           operator class's order function (provided that the same
 243           collation OID is passed to both the equalimage and order
 244           functions).
 245
 246           The core code is fundamentally unable to deduce anything about
 247           the “equality implies image equality” status of an operator
 248           class within a multiple-data-type family based on details from
 249           other operator classes in the same family. Also, it is not
 250           sensible for an operator family to register a cross-type
 251           equalimage function, and attempting to do so will result in an
 252           error. This is because “equality implies image equality” status
 253           does not just depend on sorting/equality semantics, which are
 254           more or less defined at the operator family level. In general,
 255           the semantics that one particular data type implements must be
 256           considered separately.
 257
 258           The convention followed by the operator classes included with
 259           the core PostgreSQL distribution is to register a stock, generic
 260           equalimage function. Most operator classes register
 261           btequalimage(), which indicates that deduplication is safe
 262           unconditionally. Operator classes for collatable data types such
 263           as text register btvarstrequalimage(), which indicates that
 264           deduplication is safe with deterministic collations. Best
 265           practice for third-party extensions is to register their own
 266           custom function to retain control.
 267
 268    options
 269           Optionally, a B-tree operator family may provide options
 270           (“operator class specific options”) support functions,
 271           registered under support function number 5. These functions
 272           define a set of user-visible parameters that control operator
 273           class behavior.
 274
 275           An options support function must have the signature
 276
 277 options(relopts local_relopts *) returns void
 278
 279           The function is passed a pointer to a local_relopts struct,
 280           which needs to be filled with a set of operator class specific
 281           options. The options can be accessed from other support
 282           functions using the PG_HAS_OPCLASS_OPTIONS() and
 283           PG_GET_OPCLASS_OPTIONS() macros.
 284
 285           Currently, no B-Tree operator class has an options support
 286           function. B-tree doesn't allow flexible representation of keys
 287           like GiST, SP-GiST, GIN and BRIN do. So, options probably
 288           doesn't have much application in the current B-tree index access
 289           method. Nevertheless, this support function was added to B-tree
 290           for uniformity, and will probably find uses during further
 291           evolution of B-tree in PostgreSQL.
 292
 293    skipsupport
 294           Optionally, a btree operator family may provide a skip support
 295           function, registered under support function number 6. These
 296           functions give the B-tree code a way to iterate through every
 297           possible value that can be represented by an operator class's
 298           underlying input type, in key space order. This is used by the
 299           core code when it applies the skip scan optimization. The APIs
 300           involved in this are defined in src/include/utils/skipsupport.h.
 301
 302           Operator classes that do not provide a skip support function are
 303           still eligible to use skip scan. The core code can still use its
 304           fallback strategy, though that might be suboptimal for some
 305           discrete types. It usually doesn't make sense (and may not even
 306           be feasible) for operator classes on continuous types to provide
 307           a skip support function.
 308
 309           It is not sensible for an operator family to register a
 310           cross-type skipsupport function, and attempting to do so will
 311           result in an error. This is because determining the next
 312           indexable value must happen by incrementing a value copied from
 313           an index tuple. The values generated must all be of the same
 314           underlying data type (the “skipped” index column's opclass input
 315           type).
 316
 317 65.1.4. Implementation #
 318
 319    This section covers B-Tree index implementation details that may be of
 320    use to advanced users. See src/backend/access/nbtree/README in the
 321    source distribution for a much more detailed, internals-focused
 322    description of the B-Tree implementation.
 323
 324 65.1.4.1. B-Tree Structure #
 325
 326    PostgreSQL B-Tree indexes are multi-level tree structures, where each
 327    level of the tree can be used as a doubly-linked list of pages. A
 328    single metapage is stored in a fixed position at the start of the first
 329    segment file of the index. All other pages are either leaf pages or
 330    internal pages. Leaf pages are the pages on the lowest level of the
 331    tree. All other levels consist of internal pages. Each leaf page
 332    contains tuples that point to table rows. Each internal page contains
 333    tuples that point to the next level down in the tree. Typically, over
 334    99% of all pages are leaf pages. Both internal pages and leaf pages use
 335    the standard page format described in Section 66.6.
 336
 337    New leaf pages are added to a B-Tree index when an existing leaf page
 338    cannot fit an incoming tuple. A page split operation makes room for
 339    items that originally belonged on the overflowing page by moving a
 340    portion of the items to a new page. Page splits must also insert a new
 341    downlink to the new page in the parent page, which may cause the parent
 342    to split in turn. Page splits “cascade upwards” in a recursive fashion.
 343    When the root page finally cannot fit a new downlink, a root page split
 344    operation takes place. This adds a new level to the tree structure by
 345    creating a new root page that is one level above the original root
 346    page.
 347
 348 65.1.4.2. Bottom-up Index Deletion #
 349
 350    B-Tree indexes are not directly aware that under MVCC, there might be
 351    multiple extant versions of the same logical table row; to an index,
 352    each tuple is an independent object that needs its own index entry.
 353    “Version churn” tuples may sometimes accumulate and adversely affect
 354    query latency and throughput. This typically occurs with UPDATE-heavy
 355    workloads where most individual updates cannot apply the HOT
 356    optimization. Changing the value of only one column covered by one
 357    index during an UPDATE always necessitates a new set of index tuples —
 358    one for each and every index on the table. Note in particular that this
 359    includes indexes that were not “logically modified” by the UPDATE. All
 360    indexes will need a successor physical index tuple that points to the
 361    latest version in the table. Each new tuple within each index will
 362    generally need to coexist with the original “updated” tuple for a short
 363    period of time (typically until shortly after the UPDATE transaction
 364    commits).
 365
 366    B-Tree indexes incrementally delete version churn index tuples by
 367    performing bottom-up index deletion passes. Each deletion pass is
 368    triggered in reaction to an anticipated “version churn page split”.
 369    This only happens with indexes that are not logically modified by
 370    UPDATE statements, where concentrated build up of obsolete versions in
 371    particular pages would occur otherwise. A page split will usually be
 372    avoided, though it's possible that certain implementation-level
 373    heuristics will fail to identify and delete even one garbage index
 374    tuple (in which case a page split or deduplication pass resolves the
 375    issue of an incoming new tuple not fitting on a leaf page). The
 376    worst-case number of versions that any index scan must traverse (for
 377    any single logical row) is an important contributor to overall system
 378    responsiveness and throughput. A bottom-up index deletion pass targets
 379    suspected garbage tuples in a single leaf page based on qualitative
 380    distinctions involving logical rows and versions. This contrasts with
 381    the “top-down” index cleanup performed by autovacuum workers, which is
 382    triggered when certain quantitative table-level thresholds are exceeded
 383    (see Section 24.1.6).
 384
 385 Note
 386
 387    Not all deletion operations that are performed within B-Tree indexes
 388    are bottom-up deletion operations. There is a distinct category of
 389    index tuple deletion: simple index tuple deletion. This is a deferred
 390    maintenance operation that deletes index tuples that are known to be
 391    safe to delete (those whose item identifier's LP_DEAD bit is already
 392    set). Like bottom-up index deletion, simple index deletion takes place
 393    at the point that a page split is anticipated as a way of avoiding the
 394    split.
 395
 396    Simple deletion is opportunistic in the sense that it can only take
 397    place when recent index scans set the LP_DEAD bits of affected items in
 398    passing. Prior to PostgreSQL 14, the only category of B-Tree deletion
 399    was simple deletion. The main differences between it and bottom-up
 400    deletion are that only the former is opportunistically driven by the
 401    activity of passing index scans, while only the latter specifically
 402    targets version churn from UPDATEs that do not logically modify indexed
 403    columns.
 404
 405    Bottom-up index deletion performs the vast majority of all garbage
 406    index tuple cleanup for particular indexes with certain workloads. This
 407    is expected with any B-Tree index that is subject to significant
 408    version churn from UPDATEs that rarely or never logically modify the
 409    columns that the index covers. The average and worst-case number of
 410    versions per logical row can be kept low purely through targeted
 411    incremental deletion passes. It's quite possible that the on-disk size
 412    of certain indexes will never increase by even one single page/block
 413    despite constant version churn from UPDATEs. Even then, an exhaustive
 414    “clean sweep” by a VACUUM operation (typically run in an autovacuum
 415    worker process) will eventually be required as a part of collective
 416    cleanup of the table and each of its indexes.
 417
 418    Unlike VACUUM, bottom-up index deletion does not provide any strong
 419    guarantees about how old the oldest garbage index tuple may be. No
 420    index can be permitted to retain “floating garbage” index tuples that
 421    became dead prior to a conservative cutoff point shared by the table
 422    and all of its indexes collectively. This fundamental table-level
 423    invariant makes it safe to recycle table TIDs. This is how it is
 424    possible for distinct logical rows to reuse the same table TID over
 425    time (though this can never happen with two logical rows whose
 426    lifetimes span the same VACUUM cycle).
 427
 428 65.1.4.3. Deduplication #
 429
 430    A duplicate is a leaf page tuple (a tuple that points to a table row)
 431    where all indexed key columns have values that match corresponding
 432    column values from at least one other leaf page tuple in the same
 433    index. Duplicate tuples are quite common in practice. B-Tree indexes
 434    can use a special, space-efficient representation for duplicates when
 435    an optional technique is enabled: deduplication.
 436
 437    Deduplication works by periodically merging groups of duplicate tuples
 438    together, forming a single posting list tuple for each group. The
 439    column key value(s) only appear once in this representation. This is
 440    followed by a sorted array of TIDs that point to rows in the table.
 441    This significantly reduces the storage size of indexes where each value
 442    (or each distinct combination of column values) appears several times
 443    on average. The latency of queries can be reduced significantly.
 444    Overall query throughput may increase significantly. The overhead of
 445    routine index vacuuming may also be reduced significantly.
 446
 447 Note
 448
 449    B-Tree deduplication is just as effective with “duplicates” that
 450    contain a NULL value, even though NULL values are never equal to each
 451    other according to the = member of any B-Tree operator class. As far as
 452    any part of the implementation that understands the on-disk B-Tree
 453    structure is concerned, NULL is just another value from the domain of
 454    indexed values.
 455
 456    The deduplication process occurs lazily, when a new item is inserted
 457    that cannot fit on an existing leaf page, though only when index tuple
 458    deletion could not free sufficient space for the new item (typically
 459    deletion is briefly considered and then skipped over). Unlike GIN
 460    posting list tuples, B-Tree posting list tuples do not need to expand
 461    every time a new duplicate is inserted; they are merely an alternative
 462    physical representation of the original logical contents of the leaf
 463    page. This design prioritizes consistent performance with mixed
 464    read-write workloads. Most client applications will at least see a
 465    moderate performance benefit from using deduplication. Deduplication is
 466    enabled by default.
 467
 468    CREATE INDEX and REINDEX apply deduplication to create posting list
 469    tuples, though the strategy they use is slightly different. Each group
 470    of duplicate ordinary tuples encountered in the sorted input taken from
 471    the table is merged into a posting list tuple before being added to the
 472    current pending leaf page. Individual posting list tuples are packed
 473    with as many TIDs as possible. Leaf pages are written out in the usual
 474    way, without any separate deduplication pass. This strategy is
 475    well-suited to CREATE INDEX and REINDEX because they are once-off batch
 476    operations.
 477
 478    Write-heavy workloads that don't benefit from deduplication due to
 479    having few or no duplicate values in indexes will incur a small, fixed
 480    performance penalty (unless deduplication is explicitly disabled). The
 481    deduplicate_items storage parameter can be used to disable
 482    deduplication within individual indexes. There is never any performance
 483    penalty with read-only workloads, since reading posting list tuples is
 484    at least as efficient as reading the standard tuple representation.
 485    Disabling deduplication isn't usually helpful.
 486
 487    It is sometimes possible for unique indexes (as well as unique
 488    constraints) to use deduplication. This allows leaf pages to
 489    temporarily “absorb” extra version churn duplicates. Deduplication in
 490    unique indexes augments bottom-up index deletion, especially in cases
 491    where a long-running transaction holds a snapshot that blocks garbage
 492    collection. The goal is to buy time for the bottom-up index deletion
 493    strategy to become effective again. Delaying page splits until a single
 494    long-running transaction naturally goes away can allow a bottom-up
 495    deletion pass to succeed where an earlier deletion pass failed.
 496
 497 Tip
 498
 499    A special heuristic is applied to determine whether a deduplication
 500    pass in a unique index should take place. It can often skip straight to
 501    splitting a leaf page, avoiding a performance penalty from wasting
 502    cycles on unhelpful deduplication passes. If you're concerned about the
 503    overhead of deduplication, consider setting deduplicate_items = off
 504    selectively. Leaving deduplication enabled in unique indexes has little
 505    downside.
 506
 507    Deduplication cannot be used in all cases due to implementation-level
 508    restrictions. Deduplication safety is determined when CREATE INDEX or
 509    REINDEX is run.
 510
 511    Note that deduplication is deemed unsafe and cannot be used in the
 512    following cases involving semantically significant differences among
 513    equal datums:
 514
 515      * text, varchar, and char cannot use deduplication when a
 516        nondeterministic collation is used. Case and accent differences
 517        must be preserved among equal datums.
 518      * numeric cannot use deduplication. Numeric display scale must be
 519        preserved among equal datums.
 520      * jsonb cannot use deduplication, since the jsonb B-Tree operator
 521        class uses numeric internally.
 522      * float4 and float8 cannot use deduplication. These types have
 523        distinct representations for -0 and 0, which are nevertheless
 524        considered equal. This difference must be preserved.
 525
 526    There is one further implementation-level restriction that may be
 527    lifted in a future version of PostgreSQL:
 528
 529      * Container types (such as composite types, arrays, or range types)
 530        cannot use deduplication.
 531
 532    There is one further implementation-level restriction that applies
 533    regardless of the operator class or collation used:
 534
 535      * INCLUDE indexes can never use deduplication.