begriffs open source - ai-pg/blob - full-docs/txt/gin.txt

   1
   2 65.4. GIN Indexes #
   3
   4    65.4.1. Introduction
   5    65.4.2. Built-in Operator Classes
   6    65.4.3. Extensibility
   7    65.4.4. Implementation
   8    65.4.5. GIN Tips and Tricks
   9    65.4.6. Limitations
  10    65.4.7. Examples
  11
  12 65.4.1. Introduction #
  13
  14    GIN stands for Generalized Inverted Index. GIN is designed for handling
  15    cases where the items to be indexed are composite values, and the
  16    queries to be handled by the index need to search for element values
  17    that appear within the composite items. For example, the items could be
  18    documents, and the queries could be searches for documents containing
  19    specific words.
  20
  21    We use the word item to refer to a composite value that is to be
  22    indexed, and the word key to refer to an element value. GIN always
  23    stores and searches for keys, not item values per se.
  24
  25    A GIN index stores a set of (key, posting list) pairs, where a posting
  26    list is a set of row IDs in which the key occurs. The same row ID can
  27    appear in multiple posting lists, since an item can contain more than
  28    one key. Each key value is stored only once, so a GIN index is very
  29    compact for cases where the same key appears many times.
  30
  31    GIN is generalized in the sense that the GIN access method code does
  32    not need to know the specific operations that it accelerates. Instead,
  33    it uses custom strategies defined for particular data types. The
  34    strategy defines how keys are extracted from indexed items and query
  35    conditions, and how to determine whether a row that contains some of
  36    the key values in a query actually satisfies the query.
  37
  38    One advantage of GIN is that it allows the development of custom data
  39    types with the appropriate access methods, by an expert in the domain
  40    of the data type, rather than a database expert. This is much the same
  41    advantage as using GiST.
  42
  43    The GIN implementation in PostgreSQL is primarily maintained by Teodor
  44    Sigaev and Oleg Bartunov. There is more information about GIN on their
  45    website.
  46
  47 65.4.2. Built-in Operator Classes #
  48
  49    The core PostgreSQL distribution includes the GIN operator classes
  50    shown in Table 65.3. (Some of the optional modules described in
  51    Appendix F provide additional GIN operator classes.)
  52
  53    Table 65.3. Built-in GIN Operator Classes
  54         Name       Indexable Operators
  55    array_ops      && (anyarray,anyarray)
  56                   @> (anyarray,anyarray)
  57                   <@ (anyarray,anyarray)
  58                   = (anyarray,anyarray)
  59    jsonb_ops      @> (jsonb,jsonb)
  60                   @? (jsonb,jsonpath)
  61                   @@ (jsonb,jsonpath)
  62                   ? (jsonb,text)
  63                   ?| (jsonb,text[])
  64                   ?& (jsonb,text[])
  65    jsonb_path_ops @> (jsonb,jsonb)
  66                   @? (jsonb,jsonpath)
  67                   @@ (jsonb,jsonpath)
  68    tsvector_ops   @@ (tsvector,tsquery)
  69
  70    Of the two operator classes for type jsonb, jsonb_ops is the default.
  71    jsonb_path_ops supports fewer operators but offers better performance
  72    for those operators. See Section 8.14.4 for details.
  73
  74 65.4.3. Extensibility #
  75
  76    The GIN interface has a high level of abstraction, requiring the access
  77    method implementer only to implement the semantics of the data type
  78    being accessed. The GIN layer itself takes care of concurrency, logging
  79    and searching the tree structure.
  80
  81    All it takes to get a GIN access method working is to implement a few
  82    user-defined methods, which define the behavior of keys in the tree and
  83    the relationships between keys, indexed items, and indexable queries.
  84    In short, GIN combines extensibility with generality, code reuse, and a
  85    clean interface.
  86
  87    There are two methods that an operator class for GIN must provide:
  88
  89    Datum *extractValue(Datum itemValue, int32 *nkeys, bool **nullFlags)
  90           Returns a palloc'd array of keys given an item to be indexed.
  91           The number of returned keys must be stored into *nkeys. If any
  92           of the keys can be null, also palloc an array of *nkeys bool
  93           fields, store its address at *nullFlags, and set these null
  94           flags as needed. *nullFlags can be left NULL (its initial value)
  95           if all keys are non-null. The return value can be NULL if the
  96           item contains no keys.
  97
  98    Datum *extractQuery(Datum query, int32 *nkeys, StrategyNumber n, bool
  99           **pmatch, Pointer **extra_data, bool **nullFlags, int32
 100           *searchMode)
 101           Returns a palloc'd array of keys given a value to be queried;
 102           that is, query is the value on the right-hand side of an
 103           indexable operator whose left-hand side is the indexed column. n
 104           is the strategy number of the operator within the operator class
 105           (see Section 36.16.2). Often, extractQuery will need to consult
 106           n to determine the data type of query and the method it should
 107           use to extract key values. The number of returned keys must be
 108           stored into *nkeys. If any of the keys can be null, also palloc
 109           an array of *nkeys bool fields, store its address at *nullFlags,
 110           and set these null flags as needed. *nullFlags can be left NULL
 111           (its initial value) if all keys are non-null. The return value
 112           can be NULL if the query contains no keys.
 113
 114           searchMode is an output argument that allows extractQuery to
 115           specify details about how the search will be done. If
 116           *searchMode is set to GIN_SEARCH_MODE_DEFAULT (which is the
 117           value it is initialized to before call), only items that match
 118           at least one of the returned keys are considered candidate
 119           matches. If *searchMode is set to GIN_SEARCH_MODE_INCLUDE_EMPTY,
 120           then in addition to items containing at least one matching key,
 121           items that contain no keys at all are considered candidate
 122           matches. (This mode is useful for implementing is-subset-of
 123           operators, for example.) If *searchMode is set to
 124           GIN_SEARCH_MODE_ALL, then all non-null items in the index are
 125           considered candidate matches, whether they match any of the
 126           returned keys or not. (This mode is much slower than the other
 127           two choices, since it requires scanning essentially the entire
 128           index, but it may be necessary to implement corner cases
 129           correctly. An operator that needs this mode in most cases is
 130           probably not a good candidate for a GIN operator class.) The
 131           symbols to use for setting this mode are defined in
 132           access/gin.h.
 133
 134           pmatch is an output argument for use when partial match is
 135           supported. To use it, extractQuery must allocate an array of
 136           *nkeys bools and store its address at *pmatch. Each element of
 137           the array should be set to true if the corresponding key
 138           requires partial match, false if not. If *pmatch is set to NULL
 139           then GIN assumes partial match is not required. The variable is
 140           initialized to NULL before call, so this argument can simply be
 141           ignored by operator classes that do not support partial match.
 142
 143           extra_data is an output argument that allows extractQuery to
 144           pass additional data to the consistent and comparePartial
 145           methods. To use it, extractQuery must allocate an array of
 146           *nkeys pointers and store its address at *extra_data, then store
 147           whatever it wants to into the individual pointers. The variable
 148           is initialized to NULL before call, so this argument can simply
 149           be ignored by operator classes that do not require extra data.
 150           If *extra_data is set, the whole array is passed to the
 151           consistent method, and the appropriate element to the
 152           comparePartial method.
 153
 154    An operator class must also provide a function to check if an indexed
 155    item matches the query. It comes in two flavors, a Boolean consistent
 156    function, and a ternary triConsistent function. triConsistent covers
 157    the functionality of both, so providing triConsistent alone is
 158    sufficient. However, if the Boolean variant is significantly cheaper to
 159    calculate, it can be advantageous to provide both. If only the Boolean
 160    variant is provided, some optimizations that depend on refuting index
 161    items before fetching all the keys are disabled.
 162
 163    bool consistent(bool check[], StrategyNumber n, Datum query, int32
 164           nkeys, Pointer extra_data[], bool *recheck, Datum queryKeys[],
 165           bool nullFlags[])
 166           Returns true if an indexed item satisfies the query operator
 167           with strategy number n (or might satisfy it, if the recheck
 168           indication is returned). This function does not have direct
 169           access to the indexed item's value, since GIN does not store
 170           items explicitly. Rather, what is available is knowledge about
 171           which key values extracted from the query appear in a given
 172           indexed item. The check array has length nkeys, which is the
 173           same as the number of keys previously returned by extractQuery
 174           for this query datum. Each element of the check array is true if
 175           the indexed item contains the corresponding query key, i.e., if
 176           (check[i] == true) the i-th key of the extractQuery result array
 177           is present in the indexed item. The original query datum is
 178           passed in case the consistent method needs to consult it, and so
 179           are the queryKeys[] and nullFlags[] arrays previously returned
 180           by extractQuery. extra_data is the extra-data array returned by
 181           extractQuery, or NULL if none.
 182
 183           When extractQuery returns a null key in queryKeys[], the
 184           corresponding check[] element is true if the indexed item
 185           contains a null key; that is, the semantics of check[] are like
 186           IS NOT DISTINCT FROM. The consistent function can examine the
 187           corresponding nullFlags[] element if it needs to tell the
 188           difference between a regular value match and a null match.
 189
 190           On success, *recheck should be set to true if the heap tuple
 191           needs to be rechecked against the query operator, or false if
 192           the index test is exact. That is, a false return value
 193           guarantees that the heap tuple does not match the query; a true
 194           return value with *recheck set to false guarantees that the heap
 195           tuple does match the query; and a true return value with
 196           *recheck set to true means that the heap tuple might match the
 197           query, so it needs to be fetched and rechecked by evaluating the
 198           query operator directly against the originally indexed item.
 199
 200    GinTernaryValue triConsistent(GinTernaryValue check[], StrategyNumber
 201           n, Datum query, int32 nkeys, Pointer extra_data[], Datum
 202           queryKeys[], bool nullFlags[])
 203           triConsistent is similar to consistent, but instead of Booleans
 204           in the check vector, there are three possible values for each
 205           key: GIN_TRUE, GIN_FALSE and GIN_MAYBE. GIN_FALSE and GIN_TRUE
 206           have the same meaning as regular Boolean values, while GIN_MAYBE
 207           means that the presence of that key is not known. When GIN_MAYBE
 208           values are present, the function should only return GIN_TRUE if
 209           the item certainly matches whether or not the index item
 210           contains the corresponding query keys. Likewise, the function
 211           must return GIN_FALSE only if the item certainly does not match,
 212           whether or not it contains the GIN_MAYBE keys. If the result
 213           depends on the GIN_MAYBE entries, i.e., the match cannot be
 214           confirmed or refuted based on the known query keys, the function
 215           must return GIN_MAYBE.
 216
 217           When there are no GIN_MAYBE values in the check vector, a
 218           GIN_MAYBE return value is the equivalent of setting the recheck
 219           flag in the Boolean consistent function.
 220
 221    In addition, GIN must have a way to sort the key values stored in the
 222    index. The operator class can define the sort ordering by specifying a
 223    comparison method:
 224
 225    int compare(Datum a, Datum b)
 226           Compares two keys (not indexed items!) and returns an integer
 227           less than zero, zero, or greater than zero, indicating whether
 228           the first key is less than, equal to, or greater than the
 229           second. Null keys are never passed to this function.
 230
 231    Alternatively, if the operator class does not provide a compare method,
 232    GIN will look up the default btree operator class for the index key
 233    data type, and use its comparison function. It is recommended to
 234    specify the comparison function in a GIN operator class that is meant
 235    for just one data type, as looking up the btree operator class costs a
 236    few cycles. However, polymorphic GIN operator classes (such as
 237    array_ops) typically cannot specify a single comparison function.
 238
 239    An operator class for GIN can optionally supply the following methods:
 240
 241    int comparePartial(Datum partial_key, Datum key, StrategyNumber n,
 242           Pointer extra_data)
 243           Compare a partial-match query key to an index key. Returns an
 244           integer whose sign indicates the result: less than zero means
 245           the index key does not match the query, but the index scan
 246           should continue; zero means that the index key does match the
 247           query; greater than zero indicates that the index scan should
 248           stop because no more matches are possible. The strategy number n
 249           of the operator that generated the partial match query is
 250           provided, in case its semantics are needed to determine when to
 251           end the scan. Also, extra_data is the corresponding element of
 252           the extra-data array made by extractQuery, or NULL if none. Null
 253           keys are never passed to this function.
 254
 255    void options(local_relopts *relopts)
 256           Defines a set of user-visible parameters that control operator
 257           class behavior.
 258
 259           The options function is passed a pointer to a local_relopts
 260           struct, which needs to be filled with a set of operator class
 261           specific options. The options can be accessed from other support
 262           functions using the PG_HAS_OPCLASS_OPTIONS() and
 263           PG_GET_OPCLASS_OPTIONS() macros.
 264
 265           Since both key extraction of indexed values and representation
 266           of the key in GIN are flexible, they may depend on
 267           user-specified parameters.
 268
 269    To support “partial match” queries, an operator class must provide the
 270    comparePartial method, and its extractQuery method must set the pmatch
 271    parameter when a partial-match query is encountered. See
 272    Section 65.4.4.2 for details.
 273
 274    The actual data types of the various Datum values mentioned above vary
 275    depending on the operator class. The item values passed to extractValue
 276    are always of the operator class's input type, and all key values must
 277    be of the class's STORAGE type. The type of the query argument passed
 278    to extractQuery, consistent and triConsistent is whatever is the
 279    right-hand input type of the class member operator identified by the
 280    strategy number. This need not be the same as the indexed type, so long
 281    as key values of the correct type can be extracted from it. However, it
 282    is recommended that the SQL declarations of these three support
 283    functions use the opclass's indexed data type for the query argument,
 284    even though the actual type might be something else depending on the
 285    operator.
 286
 287 65.4.4. Implementation #
 288
 289    Internally, a GIN index contains a B-tree index constructed over keys,
 290    where each key is an element of one or more indexed items (a member of
 291    an array, for example) and where each tuple in a leaf page contains
 292    either a pointer to a B-tree of heap pointers (a “posting tree”), or a
 293    simple list of heap pointers (a “posting list”) when the list is small
 294    enough to fit into a single index tuple along with the key value.
 295    Figure 65.1 illustrates these components of a GIN index.
 296
 297    As of PostgreSQL 9.1, null key values can be included in the index.
 298    Also, placeholder nulls are included in the index for indexed items
 299    that are null or contain no keys according to extractValue. This allows
 300    searches that should find empty items to do so.
 301
 302    Multicolumn GIN indexes are implemented by building a single B-tree
 303    over composite values (column number, key value). The key values for
 304    different columns can be of different types.
 305
 306    Figure 65.1. GIN Internals
 307
 308 65.4.4.1. GIN Fast Update Technique #
 309
 310    Updating a GIN index tends to be slow because of the intrinsic nature
 311    of inverted indexes: inserting or updating one heap row can cause many
 312    inserts into the index (one for each key extracted from the indexed
 313    item). GIN is capable of postponing much of this work by inserting new
 314    tuples into a temporary, unsorted list of pending entries. When the
 315    table is vacuumed or autoanalyzed, or when gin_clean_pending_list
 316    function is called, or if the pending list becomes larger than
 317    gin_pending_list_limit, the entries are moved to the main GIN data
 318    structure using the same bulk insert techniques used during initial
 319    index creation. This greatly improves GIN index update speed, even
 320    counting the additional vacuum overhead. Moreover the overhead work can
 321    be done by a background process instead of in foreground query
 322    processing.
 323
 324    The main disadvantage of this approach is that searches must scan the
 325    list of pending entries in addition to searching the regular index, and
 326    so a large list of pending entries will slow searches significantly.
 327    Another disadvantage is that, while most updates are fast, an update
 328    that causes the pending list to become “too large” will incur an
 329    immediate cleanup cycle and thus be much slower than other updates.
 330    Proper use of autovacuum can minimize both of these problems.
 331
 332    If consistent response time is more important than update speed, use of
 333    pending entries can be disabled by turning off the fastupdate storage
 334    parameter for a GIN index. See CREATE INDEX for details.
 335
 336 65.4.4.2. Partial Match Algorithm #
 337
 338    GIN can support “partial match” queries, in which the query does not
 339    determine an exact match for one or more keys, but the possible matches
 340    fall within a reasonably narrow range of key values (within the key
 341    sorting order determined by the compare support method). The
 342    extractQuery method, instead of returning a key value to be matched
 343    exactly, returns a key value that is the lower bound of the range to be
 344    searched, and sets the pmatch flag true. The key range is then scanned
 345    using the comparePartial method. comparePartial must return zero for a
 346    matching index key, less than zero for a non-match that is still within
 347    the range to be searched, or greater than zero if the index key is past
 348    the range that could match.
 349
 350 65.4.5. GIN Tips and Tricks #
 351
 352    Create vs. insert
 353           Insertion into a GIN index can be slow due to the likelihood of
 354           many keys being inserted for each item. So, for bulk insertions
 355           into a table it is advisable to drop the GIN index and recreate
 356           it after finishing bulk insertion.
 357
 358           When fastupdate is enabled for GIN (see Section 65.4.4.1 for
 359           details), the penalty is less than when it is not. But for very
 360           large updates it may still be best to drop and recreate the
 361           index.
 362
 363    maintenance_work_mem
 364           Build time for a GIN index is very sensitive to the
 365           maintenance_work_mem setting; it doesn't pay to skimp on work
 366           memory during index creation.
 367
 368    gin_pending_list_limit
 369           During a series of insertions into an existing GIN index that
 370           has fastupdate enabled, the system will clean up the
 371           pending-entry list whenever the list grows larger than
 372           gin_pending_list_limit. To avoid fluctuations in observed
 373           response time, it's desirable to have pending-list cleanup occur
 374           in the background (i.e., via autovacuum). Foreground cleanup
 375           operations can be avoided by increasing gin_pending_list_limit
 376           or making autovacuum more aggressive. However, enlarging the
 377           threshold of the cleanup operation means that if a foreground
 378           cleanup does occur, it will take even longer.
 379
 380           gin_pending_list_limit can be overridden for individual GIN
 381           indexes by changing storage parameters, which allows each GIN
 382           index to have its own cleanup threshold. For example, it's
 383           possible to increase the threshold only for the GIN index which
 384           can be updated heavily, and decrease it otherwise.
 385
 386    gin_fuzzy_search_limit
 387           The primary goal of developing GIN indexes was to create support
 388           for highly scalable full-text search in PostgreSQL, and there
 389           are often situations when a full-text search returns a very
 390           large set of results. Moreover, this often happens when the
 391           query contains very frequent words, so that the large result set
 392           is not even useful. Since reading many tuples from the disk and
 393           sorting them could take a lot of time, this is unacceptable for
 394           production. (Note that the index search itself is very fast.)
 395
 396           To facilitate controlled execution of such queries, GIN has a
 397           configurable soft upper limit on the number of rows returned:
 398           the gin_fuzzy_search_limit configuration parameter. It is set to
 399           0 (meaning no limit) by default. If a non-zero limit is set,
 400           then the returned set is a subset of the whole result set,
 401           chosen at random.
 402
 403           “Soft” means that the actual number of returned results could
 404           differ somewhat from the specified limit, depending on the query
 405           and the quality of the system's random number generator.
 406
 407           From experience, values in the thousands (e.g., 5000 — 20000)
 408           work well.
 409
 410 65.4.6. Limitations #
 411
 412    GIN assumes that indexable operators are strict. This means that
 413    extractValue will not be called at all on a null item value (instead, a
 414    placeholder index entry is created automatically), and extractQuery
 415    will not be called on a null query value either (instead, the query is
 416    presumed to be unsatisfiable). Note however that null key values
 417    contained within a non-null composite item or query value are
 418    supported.
 419
 420 65.4.7. Examples #
 421
 422    The core PostgreSQL distribution includes the GIN operator classes
 423    previously shown in Table 65.3. The following contrib modules also
 424    contain GIN operator classes:
 425
 426    btree_gin
 427           B-tree equivalent functionality for several data types
 428
 429    hstore
 430           Module for storing (key, value) pairs
 431
 432    intarray
 433           Enhanced support for int[]
 434
 435    pg_trgm
 436           Text similarity using trigram matching