2 69.2. Multivariate Statistics Examples #
4 69.2.1. Functional Dependencies
5 69.2.2. Multivariate N-Distinct Counts
8 69.2.1. Functional Dependencies #
10 Multivariate correlation can be demonstrated with a very simple data
11 set — a table with two columns, both containing the same values:
12 CREATE TABLE t (a INT, b INT);
13 INSERT INTO t SELECT i % 100, i % 100 FROM generate_series(1, 10000) s(i);
16 As explained in Section 14.2, the planner can determine cardinality of
17 t using the number of pages and rows obtained from pg_class:
18 SELECT relpages, reltuples FROM pg_class WHERE relname = 't';
21 ----------+-----------
24 The data distribution is very simple; there are only 100 distinct
25 values in each column, uniformly distributed.
27 The following example shows the result of estimating a WHERE condition
29 EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT * FROM t WHERE a = 1;
31 -------------------------------------------------------------------------------
32 Seq Scan on t (cost=0.00..170.00 rows=100 width=8) (actual rows=100.00 loops=1
35 Rows Removed by Filter: 9900
37 The planner examines the condition and determines the selectivity of
38 this clause to be 1%. By comparing this estimate and the actual number
39 of rows, we see that the estimate is very accurate (in fact exact, as
40 the table is very small). Changing the WHERE condition to use the b
41 column, an identical plan is generated. But observe what happens if we
42 apply the same condition on both columns, combining them with AND:
43 EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT * FROM t WHERE a = 1 AND b = 1
46 -----------------------------------------------------------------------------
47 Seq Scan on t (cost=0.00..195.00 rows=1 width=8) (actual rows=100.00 loops=1)
48 Filter: ((a = 1) AND (b = 1))
49 Rows Removed by Filter: 9900
51 The planner estimates the selectivity for each condition individually,
52 arriving at the same 1% estimates as above. Then it assumes that the
53 conditions are independent, and so it multiplies their selectivities,
54 producing a final selectivity estimate of just 0.01%. This is a
55 significant underestimate, as the actual number of rows matching the
56 conditions (100) is two orders of magnitude higher.
58 This problem can be fixed by creating a statistics object that directs
59 ANALYZE to calculate functional-dependency multivariate statistics on
61 CREATE STATISTICS stts (dependencies) ON a, b FROM t;
63 EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT * FROM t WHERE a = 1 AND b = 1
66 -------------------------------------------------------------------------------
67 Seq Scan on t (cost=0.00..195.00 rows=100 width=8) (actual rows=100.00 loops=1
69 Filter: ((a = 1) AND (b = 1))
70 Rows Removed by Filter: 9900
72 69.2.2. Multivariate N-Distinct Counts #
74 A similar problem occurs with estimation of the cardinality of sets of
75 multiple columns, such as the number of groups that would be generated
76 by a GROUP BY clause. When GROUP BY lists a single column, the
77 n-distinct estimate (which is visible as the estimated number of rows
78 returned by the HashAggregate node) is very accurate:
79 EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT COUNT(*) FROM t GROUP BY a;
81 -------------------------------------------------------------------------------
83 HashAggregate (cost=195.00..196.00 rows=100 width=12) (actual rows=100.00 loop
86 -> Seq Scan on t (cost=0.00..145.00 rows=10000 width=4) (actual rows=10000.
89 But without multivariate statistics, the estimate for the number of
90 groups in a query with two columns in GROUP BY, as in the following
91 example, is off by an order of magnitude:
92 EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT COUNT(*) FROM t GROUP BY a, b;
94 -------------------------------------------------------------------------------
96 HashAggregate (cost=220.00..230.00 rows=1000 width=16) (actual rows=100.00 loo
99 -> Seq Scan on t (cost=0.00..145.00 rows=10000 width=8) (actual rows=10000.
102 By redefining the statistics object to include n-distinct counts for
103 the two columns, the estimate is much improved:
104 DROP STATISTICS stts;
105 CREATE STATISTICS stts (dependencies, ndistinct) ON a, b FROM t;
107 EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT COUNT(*) FROM t GROUP BY a, b;
109 -------------------------------------------------------------------------------
111 HashAggregate (cost=220.00..221.00 rows=100 width=16) (actual rows=100.00 loop
114 -> Seq Scan on t (cost=0.00..145.00 rows=10000 width=8) (actual rows=10000.
119 As explained in Section 69.2.1, functional dependencies are very cheap
120 and efficient type of statistics, but their main limitation is their
121 global nature (only tracking dependencies at the column level, not
122 between individual column values).
124 This section introduces multivariate variant of MCV (most-common
125 values) lists, a straightforward extension of the per-column statistics
126 described in Section 69.1. These statistics address the limitation by
127 storing individual values, but it is naturally more expensive, both in
128 terms of building the statistics in ANALYZE, storage and planning time.
130 Let's look at the query from Section 69.2.1 again, but this time with a
131 MCV list created on the same set of columns (be sure to drop the
132 functional dependencies, to make sure the planner uses the newly
134 DROP STATISTICS stts;
135 CREATE STATISTICS stts2 (mcv) ON a, b FROM t;
137 EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT * FROM t WHERE a = 1 AND b = 1
140 -------------------------------------------------------------------------------
141 Seq Scan on t (cost=0.00..195.00 rows=100 width=8) (actual rows=100.00 loops=1
143 Filter: ((a = 1) AND (b = 1))
144 Rows Removed by Filter: 9900
146 The estimate is as accurate as with the functional dependencies, mostly
147 thanks to the table being fairly small and having a simple distribution
148 with a low number of distinct values. Before looking at the second
149 query, which was not handled by functional dependencies particularly
150 well, let's inspect the MCV list a bit.
152 Inspecting the MCV list is possible using pg_mcv_list_items
153 set-returning function.
154 SELECT m.* FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid),
155 pg_mcv_list_items(stxdmcv) m WHERE stxname = 'stts2';
156 index | values | nulls | frequency | base_frequency
157 -------+----------+-------+-----------+----------------
158 0 | {0, 0} | {f,f} | 0.01 | 0.0001
159 1 | {1, 1} | {f,f} | 0.01 | 0.0001
161 49 | {49, 49} | {f,f} | 0.01 | 0.0001
162 50 | {50, 50} | {f,f} | 0.01 | 0.0001
164 97 | {97, 97} | {f,f} | 0.01 | 0.0001
165 98 | {98, 98} | {f,f} | 0.01 | 0.0001
166 99 | {99, 99} | {f,f} | 0.01 | 0.0001
169 This confirms there are 100 distinct combinations in the two columns,
170 and all of them are about equally likely (1% frequency for each one).
171 The base frequency is the frequency computed from per-column
172 statistics, as if there were no multi-column statistics. Had there been
173 any null values in either of the columns, this would be identified in
176 When estimating the selectivity, the planner applies all the conditions
177 on items in the MCV list, and then sums the frequencies of the matching
178 ones. See mcv_clauselist_selectivity in src/backend/statistics/mcv.c
181 Compared to functional dependencies, MCV lists have two major
182 advantages. Firstly, the list stores actual values, making it possible
183 to decide which combinations are compatible.
184 EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT * FROM t WHERE a = 1 AND b = 1
187 ---------------------------------------------------------------------------
188 Seq Scan on t (cost=0.00..195.00 rows=1 width=8) (actual rows=0.00 loops=1)
189 Filter: ((a = 1) AND (b = 10))
190 Rows Removed by Filter: 10000
192 Secondly, MCV lists handle a wider range of clause types, not just
193 equality clauses like functional dependencies. For example, consider
194 the following range query for the same table:
195 EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT * FROM t WHERE a <= 49 AND b >
198 ---------------------------------------------------------------------------
199 Seq Scan on t (cost=0.00..195.00 rows=1 width=8) (actual rows=0.00 loops=1)
200 Filter: ((a <= 49) AND (b > 49))
201 Rows Removed by Filter: 10000