begriffs open source - ai-pg/blob - full-docs/txt/storage-toast.txt

   1
   2 66.2. TOAST #
   3
   4    66.2.1. Out-of-Line, On-Disk TOAST Storage
   5    66.2.2. Out-of-Line, In-Memory TOAST Storage
   6
   7    This section provides an overview of TOAST (The Oversized-Attribute
   8    Storage Technique).
   9
  10    PostgreSQL uses a fixed page size (commonly 8 kB), and does not allow
  11    tuples to span multiple pages. Therefore, it is not possible to store
  12    very large field values directly. To overcome this limitation, large
  13    field values are compressed and/or broken up into multiple physical
  14    rows. This happens transparently to the user, with only small impact on
  15    most of the backend code. The technique is affectionately known as
  16    TOAST (or “the best thing since sliced bread”). The TOAST
  17    infrastructure is also used to improve handling of large data values
  18    in-memory.
  19
  20    Only certain data types support TOAST — there is no need to impose the
  21    overhead on data types that cannot produce large field values. To
  22    support TOAST, a data type must have a variable-length (varlena)
  23    representation, in which, ordinarily, the first four-byte word of any
  24    stored value contains the total length of the value in bytes (including
  25    itself). TOAST does not constrain the rest of the data type's
  26    representation. The special representations collectively called TOASTed
  27    values work by modifying or reinterpreting this initial length word.
  28    Therefore, the C-level functions supporting a TOAST-able data type must
  29    be careful about how they handle potentially TOASTed input values: an
  30    input might not actually consist of a four-byte length word and
  31    contents until after it's been detoasted. (This is normally done by
  32    invoking PG_DETOAST_DATUM before doing anything with an input value,
  33    but in some cases more efficient approaches are possible. See
  34    Section 36.13.1 for more detail.)
  35
  36    TOAST usurps two bits of the varlena length word (the high-order bits
  37    on big-endian machines, the low-order bits on little-endian machines),
  38    thereby limiting the logical size of any value of a TOAST-able data
  39    type to 1 GB (2^30 - 1 bytes). When both bits are zero, the value is an
  40    ordinary un-TOASTed value of the data type, and the remaining bits of
  41    the length word give the total datum size (including length word) in
  42    bytes. When the highest-order or lowest-order bit is set, the value has
  43    only a single-byte header instead of the normal four-byte header, and
  44    the remaining bits of that byte give the total datum size (including
  45    length byte) in bytes. This alternative supports space-efficient
  46    storage of values shorter than 127 bytes, while still allowing the data
  47    type to grow to 1 GB at need. Values with single-byte headers aren't
  48    aligned on any particular boundary, whereas values with four-byte
  49    headers are aligned on at least a four-byte boundary; this omission of
  50    alignment padding provides additional space savings that is significant
  51    compared to short values. As a special case, if the remaining bits of a
  52    single-byte header are all zero (which would be impossible for a
  53    self-inclusive length), the value is a pointer to out-of-line data,
  54    with several possible alternatives as described below. The type and
  55    size of such a TOAST pointer are determined by a code stored in the
  56    second byte of the datum. Lastly, when the highest-order or
  57    lowest-order bit is clear but the adjacent bit is set, the content of
  58    the datum has been compressed and must be decompressed before use. In
  59    this case the remaining bits of the four-byte length word give the
  60    total size of the compressed datum, not the original data. Note that
  61    compression is also possible for out-of-line data but the varlena
  62    header does not tell whether it has occurred — the content of the TOAST
  63    pointer tells that, instead.
  64
  65    The compression technique used for either in-line or out-of-line
  66    compressed data can be selected for each column by setting the
  67    COMPRESSION column option in CREATE TABLE or ALTER TABLE. The default
  68    for columns with no explicit setting is to consult the
  69    default_toast_compression parameter at the time data is inserted.
  70
  71    As mentioned, there are multiple types of TOAST pointer datums. The
  72    oldest and most common type is a pointer to out-of-line data stored in
  73    a TOAST table that is separate from, but associated with, the table
  74    containing the TOAST pointer datum itself. These on-disk pointer datums
  75    are created by the TOAST management code (in
  76    access/common/toast_internals.c) when a tuple to be stored on disk is
  77    too large to be stored as-is. Further details appear in Section 66.2.1.
  78    Alternatively, a TOAST pointer datum can contain a pointer to
  79    out-of-line data that appears elsewhere in memory. Such datums are
  80    necessarily short-lived, and will never appear on-disk, but they are
  81    very useful for avoiding copying and redundant processing of large data
  82    values. Further details appear in Section 66.2.2.
  83
  84 66.2.1. Out-of-Line, On-Disk TOAST Storage #
  85
  86    If any of the columns of a table are TOAST-able, the table will have an
  87    associated TOAST table, whose OID is stored in the table's
  88    pg_class.reltoastrelid entry. On-disk TOASTed values are kept in the
  89    TOAST table, as described in more detail below.
  90
  91    Out-of-line values are divided (after compression if used) into chunks
  92    of at most TOAST_MAX_CHUNK_SIZE bytes (by default this value is chosen
  93    so that four chunk rows will fit on a page, making it about 2000
  94    bytes). Each chunk is stored as a separate row in the TOAST table
  95    belonging to the owning table. Every TOAST table has the columns
  96    chunk_id (an OID identifying the particular TOASTed value), chunk_seq
  97    (a sequence number for the chunk within its value), and chunk_data (the
  98    actual data of the chunk). A unique index on chunk_id and chunk_seq
  99    provides fast retrieval of the values. A pointer datum representing an
 100    out-of-line on-disk TOASTed value therefore needs to store the OID of
 101    the TOAST table in which to look and the OID of the specific value (its
 102    chunk_id). For convenience, pointer datums also store the logical datum
 103    size (original uncompressed data length), physical stored size
 104    (different if compression was applied), and the compression method
 105    used, if any. Allowing for the varlena header bytes, the total size of
 106    an on-disk TOAST pointer datum is therefore 18 bytes regardless of the
 107    actual size of the represented value.
 108
 109    The TOAST management code is triggered only when a row value to be
 110    stored in a table is wider than TOAST_TUPLE_THRESHOLD bytes (normally 2
 111    kB). The TOAST code will compress and/or move field values out-of-line
 112    until the row value is shorter than TOAST_TUPLE_TARGET bytes (also
 113    normally 2 kB, adjustable) or no more gains can be had. During an
 114    UPDATE operation, values of unchanged fields are normally preserved
 115    as-is; so an UPDATE of a row with out-of-line values incurs no TOAST
 116    costs if none of the out-of-line values change.
 117
 118    The TOAST management code recognizes four different strategies for
 119    storing TOAST-able columns on disk:
 120      * PLAIN prevents either compression or out-of-line storage. This is
 121        the only possible strategy for columns of non-TOAST-able data
 122        types.
 123      * EXTENDED allows both compression and out-of-line storage. This is
 124        the default for most TOAST-able data types. Compression will be
 125        attempted first, then out-of-line storage if the row is still too
 126        big.
 127      * EXTERNAL allows out-of-line storage but not compression. Use of
 128        EXTERNAL will make substring operations on wide text and bytea
 129        columns faster (at the penalty of increased storage space) because
 130        these operations are optimized to fetch only the required parts of
 131        the out-of-line value when it is not compressed.
 132      * MAIN allows compression but not out-of-line storage. (Actually,
 133        out-of-line storage will still be performed for such columns, but
 134        only as a last resort when there is no other way to make the row
 135        small enough to fit on a page.)
 136
 137    Each TOAST-able data type specifies a default strategy for columns of
 138    that data type, but the strategy for a given table column can be
 139    altered with ALTER TABLE ... SET STORAGE.
 140
 141    TOAST_TUPLE_TARGET can be adjusted for each table using ALTER TABLE ...
 142    SET (toast_tuple_target = N)
 143
 144    This scheme has a number of advantages compared to a more
 145    straightforward approach such as allowing row values to span pages.
 146    Assuming that queries are usually qualified by comparisons against
 147    relatively small key values, most of the work of the executor will be
 148    done using the main row entry. The big values of TOASTed attributes
 149    will only be pulled out (if selected at all) at the time the result set
 150    is sent to the client. Thus, the main table is much smaller and more of
 151    its rows fit in the shared buffer cache than would be the case without
 152    any out-of-line storage. Sort sets shrink also, and sorts will more
 153    often be done entirely in memory. A little test showed that a table
 154    containing typical HTML pages and their URLs was stored in about half
 155    of the raw data size including the TOAST table, and that the main table
 156    contained only about 10% of the entire data (the URLs and some small
 157    HTML pages). There was no run time difference compared to an un-TOASTed
 158    comparison table, in which all the HTML pages were cut down to 7 kB to
 159    fit.
 160
 161 66.2.2. Out-of-Line, In-Memory TOAST Storage #
 162
 163    TOAST pointers can point to data that is not on disk, but is elsewhere
 164    in the memory of the current server process. Such pointers obviously
 165    cannot be long-lived, but they are nonetheless useful. There are
 166    currently two sub-cases: pointers to indirect data and pointers to
 167    expanded data.
 168
 169    Indirect TOAST pointers simply point at a non-indirect varlena value
 170    stored somewhere in memory. This case was originally created merely as
 171    a proof of concept, but it is currently used during logical decoding to
 172    avoid possibly having to create physical tuples exceeding 1 GB (as
 173    pulling all out-of-line field values into the tuple might do). The case
 174    is of limited use since the creator of the pointer datum is entirely
 175    responsible that the referenced data survives for as long as the
 176    pointer could exist, and there is no infrastructure to help with this.
 177
 178    Expanded TOAST pointers are useful for complex data types whose on-disk
 179    representation is not especially suited for computational purposes. As
 180    an example, the standard varlena representation of a PostgreSQL array
 181    includes dimensionality information, a nulls bitmap if there are any
 182    null elements, then the values of all the elements in order. When the
 183    element type itself is variable-length, the only way to find the N'th
 184    element is to scan through all the preceding elements. This
 185    representation is appropriate for on-disk storage because of its
 186    compactness, but for computations with the array it's much nicer to
 187    have an “expanded” or “deconstructed” representation in which all the
 188    element starting locations have been identified. The TOAST pointer
 189    mechanism supports this need by allowing a pass-by-reference Datum to
 190    point to either a standard varlena value (the on-disk representation)
 191    or a TOAST pointer that points to an expanded representation somewhere
 192    in memory. The details of this expanded representation are up to the
 193    data type, though it must have a standard header and meet the other API
 194    requirements given in src/include/utils/expandeddatum.h. C-level
 195    functions working with the data type can choose to handle either
 196    representation. Functions that do not know about the expanded
 197    representation, but simply apply PG_DETOAST_DATUM to their inputs, will
 198    automatically receive the traditional varlena representation; so
 199    support for an expanded representation can be introduced incrementally,
 200    one function at a time.
 201
 202    TOAST pointers to expanded values are further broken down into
 203    read-write and read-only pointers. The pointed-to representation is the
 204    same either way, but a function that receives a read-write pointer is
 205    allowed to modify the referenced value in-place, whereas one that
 206    receives a read-only pointer must not; it must first create a copy if
 207    it wants to make a modified version of the value. This distinction and
 208    some associated conventions make it possible to avoid unnecessary
 209    copying of expanded values during query execution.
 210
 211    For all types of in-memory TOAST pointer, the TOAST management code
 212    ensures that no such pointer datum can accidentally get stored on disk.
 213    In-memory TOAST pointers are automatically expanded to normal in-line
 214    varlena values before storage — and then possibly converted to on-disk
 215    TOAST pointers, if the containing tuple would otherwise be too big.