begriffs open source - ai-pg/blob - full-docs/txt/bloom.txt

   1
   2 F.6. bloom — bloom filter index access method #
   3
   4    F.6.1. Parameters
   5    F.6.2. Examples
   6    F.6.3. Operator Class Interface
   7    F.6.4. Limitations
   8    F.6.5. Authors
   9
  10    bloom provides an index access method based on Bloom filters.
  11
  12    A Bloom filter is a space-efficient data structure that is used to test
  13    whether an element is a member of a set. In the case of an index access
  14    method, it allows fast exclusion of non-matching tuples via signatures
  15    whose size is determined at index creation.
  16
  17    A signature is a lossy representation of the indexed attribute(s), and
  18    as such is prone to reporting false positives; that is, it may be
  19    reported that an element is in the set, when it is not. So index search
  20    results must always be rechecked using the actual attribute values from
  21    the heap entry. Larger signatures reduce the odds of a false positive
  22    and thus reduce the number of useless heap visits, but of course also
  23    make the index larger and hence slower to scan.
  24
  25    This type of index is most useful when a table has many attributes and
  26    queries test arbitrary combinations of them. A traditional btree index
  27    is faster than a bloom index, but it can require many btree indexes to
  28    support all possible queries where one needs only a single bloom index.
  29    Note however that bloom indexes only support equality queries, whereas
  30    btree indexes can also perform inequality and range searches.
  31
  32 F.6.1. Parameters #
  33
  34    A bloom index accepts the following parameters in its WITH clause:
  35
  36    length
  37           Length of each signature (index entry) in bits. It is rounded up
  38           to the nearest multiple of 16. The default is 80 bits and the
  39           maximum is 4096.
  40
  41    col1 — col32
  42           Number of bits generated for each index column. Each parameter's
  43           name refers to the number of the index column that it controls.
  44           The default is 2 bits and the maximum is 4095. Parameters for
  45           index columns not actually used are ignored.
  46
  47 F.6.2. Examples #
  48
  49    This is an example of creating a bloom index:
  50 CREATE INDEX bloomidx ON tbloom USING bloom (i1,i2,i3)
  51        WITH (length=80, col1=2, col2=2, col3=4);
  52
  53    The index is created with a signature length of 80 bits, with
  54    attributes i1 and i2 mapped to 2 bits, and attribute i3 mapped to 4
  55    bits. We could have omitted the length, col1, and col2 specifications
  56    since those have the default values.
  57
  58    Here is a more complete example of bloom index definition and usage, as
  59    well as a comparison with equivalent btree indexes. The bloom index is
  60    considerably smaller than the btree index, and can perform better.
  61 =# CREATE TABLE tbloom AS
  62    SELECT
  63      (random() * 1000000)::int as i1,
  64      (random() * 1000000)::int as i2,
  65      (random() * 1000000)::int as i3,
  66      (random() * 1000000)::int as i4,
  67      (random() * 1000000)::int as i5,
  68      (random() * 1000000)::int as i6
  69    FROM
  70   generate_series(1,10000000);
  71 SELECT 10000000
  72
  73    A sequential scan over this large table takes a long time:
  74 =# EXPLAIN ANALYZE SELECT * FROM tbloom WHERE i2 = 898732 AND i5 = 123451;
  75                                               QUERY PLAN
  76 -------------------------------------------------------------------------------
  77 -----------------------
  78  Seq Scan on tbloom  (cost=0.00..213744.00 rows=250 width=24) (actual time=357.0
  79 59..357.059 rows=0.00 loops=1)
  80    Filter: ((i2 = 898732) AND (i5 = 123451))
  81    Rows Removed by Filter: 10000000
  82    Buffers: shared hit=63744
  83  Planning Time: 0.346 ms
  84  Execution Time: 357.076 ms
  85 (6 rows)
  86
  87    Even with the btree index defined the result will still be a sequential
  88    scan:
  89 =# CREATE INDEX btreeidx ON tbloom (i1, i2, i3, i4, i5, i6);
  90 CREATE INDEX
  91 =# SELECT pg_size_pretty(pg_relation_size('btreeidx'));
  92  pg_size_pretty
  93 ----------------
  94  386 MB
  95 (1 row)
  96 =# EXPLAIN ANALYZE SELECT * FROM tbloom WHERE i2 = 898732 AND i5 = 123451;
  97                                               QUERY PLAN
  98 -------------------------------------------------------------------------------
  99 -----------------------
 100  Seq Scan on tbloom  (cost=0.00..213744.00 rows=2 width=24) (actual time=351.016
 101 ..351.017 rows=0.00 loops=1)
 102    Filter: ((i2 = 898732) AND (i5 = 123451))
 103    Rows Removed by Filter: 10000000
 104    Buffers: shared hit=63744
 105  Planning Time: 0.138 ms
 106  Execution Time: 351.035 ms
 107 (6 rows)
 108
 109    Having the bloom index defined on the table is better than btree in
 110    handling this type of search:
 111 =# CREATE INDEX bloomidx ON tbloom USING bloom (i1, i2, i3, i4, i5, i6);
 112 CREATE INDEX
 113 =# SELECT pg_size_pretty(pg_relation_size('bloomidx'));
 114  pg_size_pretty
 115 ----------------
 116  153 MB
 117 (1 row)
 118 =# EXPLAIN ANALYZE SELECT * FROM tbloom WHERE i2 = 898732 AND i5 = 123451;
 119                                                      QUERY PLAN
 120 -------------------------------------------------------------------------------
 121 --------------------------------------
 122  Bitmap Heap Scan on tbloom  (cost=1792.00..1799.69 rows=2 width=24) (actual tim
 123 e=22.605..22.606 rows=0.00 loops=1)
 124    Recheck Cond: ((i2 = 898732) AND (i5 = 123451))
 125    Rows Removed by Index Recheck: 2300
 126    Heap Blocks: exact=2256
 127    Buffers: shared hit=21864
 128    ->  Bitmap Index Scan on bloomidx  (cost=0.00..178436.00 rows=1 width=0) (act
 129 ual time=20.005..20.005 rows=2300.00 loops=1)
 130          Index Cond: ((i2 = 898732) AND (i5 = 123451))
 131          Index Searches: 1
 132          Buffers: shared hit=19608
 133  Planning Time: 0.099 ms
 134  Execution Time: 22.632 ms
 135 (11 rows)
 136
 137    Now, the main problem with the btree search is that btree is
 138    inefficient when the search conditions do not constrain the leading
 139    index column(s). A better strategy for btree is to create a separate
 140    index on each column. Then the planner will choose something like this:
 141 =# CREATE INDEX btreeidx1 ON tbloom (i1);
 142 CREATE INDEX
 143 =# CREATE INDEX btreeidx2 ON tbloom (i2);
 144 CREATE INDEX
 145 =# CREATE INDEX btreeidx3 ON tbloom (i3);
 146 CREATE INDEX
 147 =# CREATE INDEX btreeidx4 ON tbloom (i4);
 148 CREATE INDEX
 149 =# CREATE INDEX btreeidx5 ON tbloom (i5);
 150 CREATE INDEX
 151 =# CREATE INDEX btreeidx6 ON tbloom (i6);
 152 CREATE INDEX
 153 =# EXPLAIN ANALYZE SELECT * FROM tbloom WHERE i2 = 898732 AND i5 = 123451;
 154                                                         QUERY PLAN
 155 -------------------------------------------------------------------------------
 156 --------------------------------------------
 157  Bitmap Heap Scan on tbloom  (cost=9.29..13.30 rows=1 width=24) (actual time=0.0
 158 32..0.033 rows=0.00 loops=1)
 159    Recheck Cond: ((i5 = 123451) AND (i2 = 898732))
 160    Buffers: shared read=6
 161    ->  BitmapAnd  (cost=9.29..9.29 rows=1 width=0) (actual time=0.047..0.047 row
 162 s=0.00 loops=1)
 163          Buffers: shared hit=6
 164          ->  Bitmap Index Scan on btreeidx5  (cost=0.00..4.52 rows=11 width=0) (
 165 actual time=0.026..0.026 rows=7.00 loops=1)
 166                Index Cond: (i5 = 123451)
 167                Index Searches: 1
 168                Buffers: shared hit=3
 169          ->  Bitmap Index Scan on btreeidx2  (cost=0.00..4.52 rows=11 width=0) (
 170 actual time=0.007..0.007 rows=8.00 loops=1)
 171                Index Cond: (i2 = 898732)
 172                Index Searches: 1
 173                Buffers: shared hit=3
 174  Planning Time: 0.264 ms
 175  Execution Time: 0.047 ms
 176 (15 rows)
 177
 178    Although this query runs much faster than with either of the single
 179    indexes, we pay a penalty in index size. Each of the single-column
 180    btree indexes occupies 88.5 MB, so the total space needed is 531 MB,
 181    over three times the space used by the bloom index.
 182
 183 F.6.3. Operator Class Interface #
 184
 185    An operator class for bloom indexes requires only a hash function for
 186    the indexed data type and an equality operator for searching. This
 187    example shows the operator class definition for the text data type:
 188 CREATE OPERATOR CLASS text_ops
 189 DEFAULT FOR TYPE text USING bloom AS
 190     OPERATOR    1   =(text, text),
 191     FUNCTION    1   hashtext(text);
 192
 193 F.6.4. Limitations #
 194
 195      * Only operator classes for int4 and text are included with the
 196        module.
 197      * Only the = operator is supported for search. But it is possible to
 198        add support for arrays with union and intersection operations in
 199        the future.
 200      * bloom access method doesn't support UNIQUE indexes.
 201      * bloom access method doesn't support searching for NULL values.
 202
 203 F.6.5. Authors #
 204
 205    Teodor Sigaev <teodor@postgrespro.ru>, Postgres Professional, Moscow,
 206    Russia
 207
 208    Alexander Korotkov <a.korotkov@postgrespro.ru>, Postgres Professional,
 209    Moscow, Russia
 210
 211    Oleg Bartunov <obartunov@postgrespro.ru>, Postgres Professional,
 212    Moscow, Russia