2 F.6. bloom — bloom filter index access method #
6 F.6.3. Operator Class Interface
10 bloom provides an index access method based on Bloom filters.
12 A Bloom filter is a space-efficient data structure that is used to test
13 whether an element is a member of a set. In the case of an index access
14 method, it allows fast exclusion of non-matching tuples via signatures
15 whose size is determined at index creation.
17 A signature is a lossy representation of the indexed attribute(s), and
18 as such is prone to reporting false positives; that is, it may be
19 reported that an element is in the set, when it is not. So index search
20 results must always be rechecked using the actual attribute values from
21 the heap entry. Larger signatures reduce the odds of a false positive
22 and thus reduce the number of useless heap visits, but of course also
23 make the index larger and hence slower to scan.
25 This type of index is most useful when a table has many attributes and
26 queries test arbitrary combinations of them. A traditional btree index
27 is faster than a bloom index, but it can require many btree indexes to
28 support all possible queries where one needs only a single bloom index.
29 Note however that bloom indexes only support equality queries, whereas
30 btree indexes can also perform inequality and range searches.
34 A bloom index accepts the following parameters in its WITH clause:
37 Length of each signature (index entry) in bits. It is rounded up
38 to the nearest multiple of 16. The default is 80 bits and the
42 Number of bits generated for each index column. Each parameter's
43 name refers to the number of the index column that it controls.
44 The default is 2 bits and the maximum is 4095. Parameters for
45 index columns not actually used are ignored.
49 This is an example of creating a bloom index:
50 CREATE INDEX bloomidx ON tbloom USING bloom (i1,i2,i3)
51 WITH (length=80, col1=2, col2=2, col3=4);
53 The index is created with a signature length of 80 bits, with
54 attributes i1 and i2 mapped to 2 bits, and attribute i3 mapped to 4
55 bits. We could have omitted the length, col1, and col2 specifications
56 since those have the default values.
58 Here is a more complete example of bloom index definition and usage, as
59 well as a comparison with equivalent btree indexes. The bloom index is
60 considerably smaller than the btree index, and can perform better.
61 =# CREATE TABLE tbloom AS
63 (random() * 1000000)::int as i1,
64 (random() * 1000000)::int as i2,
65 (random() * 1000000)::int as i3,
66 (random() * 1000000)::int as i4,
67 (random() * 1000000)::int as i5,
68 (random() * 1000000)::int as i6
70 generate_series(1,10000000);
73 A sequential scan over this large table takes a long time:
74 =# EXPLAIN ANALYZE SELECT * FROM tbloom WHERE i2 = 898732 AND i5 = 123451;
76 -------------------------------------------------------------------------------
77 -----------------------
78 Seq Scan on tbloom (cost=0.00..213744.00 rows=250 width=24) (actual time=357.0
79 59..357.059 rows=0.00 loops=1)
80 Filter: ((i2 = 898732) AND (i5 = 123451))
81 Rows Removed by Filter: 10000000
82 Buffers: shared hit=63744
83 Planning Time: 0.346 ms
84 Execution Time: 357.076 ms
87 Even with the btree index defined the result will still be a sequential
89 =# CREATE INDEX btreeidx ON tbloom (i1, i2, i3, i4, i5, i6);
91 =# SELECT pg_size_pretty(pg_relation_size('btreeidx'));
96 =# EXPLAIN ANALYZE SELECT * FROM tbloom WHERE i2 = 898732 AND i5 = 123451;
98 -------------------------------------------------------------------------------
99 -----------------------
100 Seq Scan on tbloom (cost=0.00..213744.00 rows=2 width=24) (actual time=351.016
101 ..351.017 rows=0.00 loops=1)
102 Filter: ((i2 = 898732) AND (i5 = 123451))
103 Rows Removed by Filter: 10000000
104 Buffers: shared hit=63744
105 Planning Time: 0.138 ms
106 Execution Time: 351.035 ms
109 Having the bloom index defined on the table is better than btree in
110 handling this type of search:
111 =# CREATE INDEX bloomidx ON tbloom USING bloom (i1, i2, i3, i4, i5, i6);
113 =# SELECT pg_size_pretty(pg_relation_size('bloomidx'));
118 =# EXPLAIN ANALYZE SELECT * FROM tbloom WHERE i2 = 898732 AND i5 = 123451;
120 -------------------------------------------------------------------------------
121 --------------------------------------
122 Bitmap Heap Scan on tbloom (cost=1792.00..1799.69 rows=2 width=24) (actual tim
123 e=22.605..22.606 rows=0.00 loops=1)
124 Recheck Cond: ((i2 = 898732) AND (i5 = 123451))
125 Rows Removed by Index Recheck: 2300
126 Heap Blocks: exact=2256
127 Buffers: shared hit=21864
128 -> Bitmap Index Scan on bloomidx (cost=0.00..178436.00 rows=1 width=0) (act
129 ual time=20.005..20.005 rows=2300.00 loops=1)
130 Index Cond: ((i2 = 898732) AND (i5 = 123451))
132 Buffers: shared hit=19608
133 Planning Time: 0.099 ms
134 Execution Time: 22.632 ms
137 Now, the main problem with the btree search is that btree is
138 inefficient when the search conditions do not constrain the leading
139 index column(s). A better strategy for btree is to create a separate
140 index on each column. Then the planner will choose something like this:
141 =# CREATE INDEX btreeidx1 ON tbloom (i1);
143 =# CREATE INDEX btreeidx2 ON tbloom (i2);
145 =# CREATE INDEX btreeidx3 ON tbloom (i3);
147 =# CREATE INDEX btreeidx4 ON tbloom (i4);
149 =# CREATE INDEX btreeidx5 ON tbloom (i5);
151 =# CREATE INDEX btreeidx6 ON tbloom (i6);
153 =# EXPLAIN ANALYZE SELECT * FROM tbloom WHERE i2 = 898732 AND i5 = 123451;
155 -------------------------------------------------------------------------------
156 --------------------------------------------
157 Bitmap Heap Scan on tbloom (cost=9.29..13.30 rows=1 width=24) (actual time=0.0
158 32..0.033 rows=0.00 loops=1)
159 Recheck Cond: ((i5 = 123451) AND (i2 = 898732))
160 Buffers: shared read=6
161 -> BitmapAnd (cost=9.29..9.29 rows=1 width=0) (actual time=0.047..0.047 row
163 Buffers: shared hit=6
164 -> Bitmap Index Scan on btreeidx5 (cost=0.00..4.52 rows=11 width=0) (
165 actual time=0.026..0.026 rows=7.00 loops=1)
166 Index Cond: (i5 = 123451)
168 Buffers: shared hit=3
169 -> Bitmap Index Scan on btreeidx2 (cost=0.00..4.52 rows=11 width=0) (
170 actual time=0.007..0.007 rows=8.00 loops=1)
171 Index Cond: (i2 = 898732)
173 Buffers: shared hit=3
174 Planning Time: 0.264 ms
175 Execution Time: 0.047 ms
178 Although this query runs much faster than with either of the single
179 indexes, we pay a penalty in index size. Each of the single-column
180 btree indexes occupies 88.5 MB, so the total space needed is 531 MB,
181 over three times the space used by the bloom index.
183 F.6.3. Operator Class Interface #
185 An operator class for bloom indexes requires only a hash function for
186 the indexed data type and an equality operator for searching. This
187 example shows the operator class definition for the text data type:
188 CREATE OPERATOR CLASS text_ops
189 DEFAULT FOR TYPE text USING bloom AS
190 OPERATOR 1 =(text, text),
191 FUNCTION 1 hashtext(text);
195 * Only operator classes for int4 and text are included with the
197 * Only the = operator is supported for search. But it is possible to
198 add support for arrays with union and intersection operations in
200 * bloom access method doesn't support UNIQUE indexes.
201 * bloom access method doesn't support searching for NULL values.
205 Teodor Sigaev <teodor@postgrespro.ru>, Postgres Professional, Moscow,
208 Alexander Korotkov <a.korotkov@postgrespro.ru>, Postgres Professional,
211 Oleg Bartunov <obartunov@postgrespro.ru>, Postgres Professional,