begriffs open source - ai-pg/blob - full-docs/txt/multivariate-statistics-examples.txt

   1
   2 69.2. Multivariate Statistics Examples #
   3
   4    69.2.1. Functional Dependencies
   5    69.2.2. Multivariate N-Distinct Counts
   6    69.2.3. MCV Lists
   7
   8 69.2.1. Functional Dependencies #
   9
  10    Multivariate correlation can be demonstrated with a very simple data
  11    set — a table with two columns, both containing the same values:
  12 CREATE TABLE t (a INT, b INT);
  13 INSERT INTO t SELECT i % 100, i % 100 FROM generate_series(1, 10000) s(i);
  14 ANALYZE t;
  15
  16    As explained in Section 14.2, the planner can determine cardinality of
  17    t using the number of pages and rows obtained from pg_class:
  18 SELECT relpages, reltuples FROM pg_class WHERE relname = 't';
  19
  20  relpages | reltuples
  21 ----------+-----------
  22        45 |     10000
  23
  24    The data distribution is very simple; there are only 100 distinct
  25    values in each column, uniformly distributed.
  26
  27    The following example shows the result of estimating a WHERE condition
  28    on the a column:
  29 EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT * FROM t WHERE a = 1;
  30                                  QUERY PLAN
  31 -------------------------------------------------------------------------------
  32  Seq Scan on t  (cost=0.00..170.00 rows=100 width=8) (actual rows=100.00 loops=1
  33 )
  34    Filter: (a = 1)
  35    Rows Removed by Filter: 9900
  36
  37    The planner examines the condition and determines the selectivity of
  38    this clause to be 1%. By comparing this estimate and the actual number
  39    of rows, we see that the estimate is very accurate (in fact exact, as
  40    the table is very small). Changing the WHERE condition to use the b
  41    column, an identical plan is generated. But observe what happens if we
  42    apply the same condition on both columns, combining them with AND:
  43 EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT * FROM t WHERE a = 1 AND b = 1
  44 ;
  45                                  QUERY PLAN
  46 -----------------------------------------------------------------------------
  47  Seq Scan on t  (cost=0.00..195.00 rows=1 width=8) (actual rows=100.00 loops=1)
  48    Filter: ((a = 1) AND (b = 1))
  49    Rows Removed by Filter: 9900
  50
  51    The planner estimates the selectivity for each condition individually,
  52    arriving at the same 1% estimates as above. Then it assumes that the
  53    conditions are independent, and so it multiplies their selectivities,
  54    producing a final selectivity estimate of just 0.01%. This is a
  55    significant underestimate, as the actual number of rows matching the
  56    conditions (100) is two orders of magnitude higher.
  57
  58    This problem can be fixed by creating a statistics object that directs
  59    ANALYZE to calculate functional-dependency multivariate statistics on
  60    the two columns:
  61 CREATE STATISTICS stts (dependencies) ON a, b FROM t;
  62 ANALYZE t;
  63 EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT * FROM t WHERE a = 1 AND b = 1
  64 ;
  65                                   QUERY PLAN
  66 -------------------------------------------------------------------------------
  67  Seq Scan on t  (cost=0.00..195.00 rows=100 width=8) (actual rows=100.00 loops=1
  68 )
  69    Filter: ((a = 1) AND (b = 1))
  70    Rows Removed by Filter: 9900
  71
  72 69.2.2. Multivariate N-Distinct Counts #
  73
  74    A similar problem occurs with estimation of the cardinality of sets of
  75    multiple columns, such as the number of groups that would be generated
  76    by a GROUP BY clause. When GROUP BY lists a single column, the
  77    n-distinct estimate (which is visible as the estimated number of rows
  78    returned by the HashAggregate node) is very accurate:
  79 EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT COUNT(*) FROM t GROUP BY a;
  80                                        QUERY PLAN
  81 -------------------------------------------------------------------------------
  82 ----------
  83  HashAggregate  (cost=195.00..196.00 rows=100 width=12) (actual rows=100.00 loop
  84 s=1)
  85    Group Key: a
  86    ->  Seq Scan on t  (cost=0.00..145.00 rows=10000 width=4) (actual rows=10000.
  87 00 loops=1)
  88
  89    But without multivariate statistics, the estimate for the number of
  90    groups in a query with two columns in GROUP BY, as in the following
  91    example, is off by an order of magnitude:
  92 EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT COUNT(*) FROM t GROUP BY a, b;
  93                                        QUERY PLAN
  94 -------------------------------------------------------------------------------
  95 -------------
  96  HashAggregate  (cost=220.00..230.00 rows=1000 width=16) (actual rows=100.00 loo
  97 ps=1)
  98    Group Key: a, b
  99    ->  Seq Scan on t  (cost=0.00..145.00 rows=10000 width=8) (actual rows=10000.
 100 00 loops=1)
 101
 102    By redefining the statistics object to include n-distinct counts for
 103    the two columns, the estimate is much improved:
 104 DROP STATISTICS stts;
 105 CREATE STATISTICS stts (dependencies, ndistinct) ON a, b FROM t;
 106 ANALYZE t;
 107 EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT COUNT(*) FROM t GROUP BY a, b;
 108                                        QUERY PLAN
 109 -------------------------------------------------------------------------------
 110 -------------
 111  HashAggregate  (cost=220.00..221.00 rows=100 width=16) (actual rows=100.00 loop
 112 s=1)
 113    Group Key: a, b
 114    ->  Seq Scan on t  (cost=0.00..145.00 rows=10000 width=8) (actual rows=10000.
 115 00 loops=1)
 116
 117 69.2.3. MCV Lists #
 118
 119    As explained in Section 69.2.1, functional dependencies are very cheap
 120    and efficient type of statistics, but their main limitation is their
 121    global nature (only tracking dependencies at the column level, not
 122    between individual column values).
 123
 124    This section introduces multivariate variant of MCV (most-common
 125    values) lists, a straightforward extension of the per-column statistics
 126    described in Section 69.1. These statistics address the limitation by
 127    storing individual values, but it is naturally more expensive, both in
 128    terms of building the statistics in ANALYZE, storage and planning time.
 129
 130    Let's look at the query from Section 69.2.1 again, but this time with a
 131    MCV list created on the same set of columns (be sure to drop the
 132    functional dependencies, to make sure the planner uses the newly
 133    created statistics).
 134 DROP STATISTICS stts;
 135 CREATE STATISTICS stts2 (mcv) ON a, b FROM t;
 136 ANALYZE t;
 137 EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT * FROM t WHERE a = 1 AND b = 1
 138 ;
 139                                    QUERY PLAN
 140 -------------------------------------------------------------------------------
 141  Seq Scan on t  (cost=0.00..195.00 rows=100 width=8) (actual rows=100.00 loops=1
 142 )
 143    Filter: ((a = 1) AND (b = 1))
 144    Rows Removed by Filter: 9900
 145
 146    The estimate is as accurate as with the functional dependencies, mostly
 147    thanks to the table being fairly small and having a simple distribution
 148    with a low number of distinct values. Before looking at the second
 149    query, which was not handled by functional dependencies particularly
 150    well, let's inspect the MCV list a bit.
 151
 152    Inspecting the MCV list is possible using pg_mcv_list_items
 153    set-returning function.
 154 SELECT m.* FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid),
 155                 pg_mcv_list_items(stxdmcv) m WHERE stxname = 'stts2';
 156  index |  values  | nulls | frequency | base_frequency
 157 -------+----------+-------+-----------+----------------
 158      0 | {0, 0}   | {f,f} |      0.01 |         0.0001
 159      1 | {1, 1}   | {f,f} |      0.01 |         0.0001
 160    ...
 161     49 | {49, 49} | {f,f} |      0.01 |         0.0001
 162     50 | {50, 50} | {f,f} |      0.01 |         0.0001
 163    ...
 164     97 | {97, 97} | {f,f} |      0.01 |         0.0001
 165     98 | {98, 98} | {f,f} |      0.01 |         0.0001
 166     99 | {99, 99} | {f,f} |      0.01 |         0.0001
 167 (100 rows)
 168
 169    This confirms there are 100 distinct combinations in the two columns,
 170    and all of them are about equally likely (1% frequency for each one).
 171    The base frequency is the frequency computed from per-column
 172    statistics, as if there were no multi-column statistics. Had there been
 173    any null values in either of the columns, this would be identified in
 174    the nulls column.
 175
 176    When estimating the selectivity, the planner applies all the conditions
 177    on items in the MCV list, and then sums the frequencies of the matching
 178    ones. See mcv_clauselist_selectivity in src/backend/statistics/mcv.c
 179    for details.
 180
 181    Compared to functional dependencies, MCV lists have two major
 182    advantages. Firstly, the list stores actual values, making it possible
 183    to decide which combinations are compatible.
 184 EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT * FROM t WHERE a = 1 AND b = 1
 185 0;
 186                                  QUERY PLAN
 187 ---------------------------------------------------------------------------
 188  Seq Scan on t  (cost=0.00..195.00 rows=1 width=8) (actual rows=0.00 loops=1)
 189    Filter: ((a = 1) AND (b = 10))
 190    Rows Removed by Filter: 10000
 191
 192    Secondly, MCV lists handle a wider range of clause types, not just
 193    equality clauses like functional dependencies. For example, consider
 194    the following range query for the same table:
 195 EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT * FROM t WHERE a <= 49 AND b >
 196  49;
 197                                 QUERY PLAN
 198 ---------------------------------------------------------------------------
 199  Seq Scan on t  (cost=0.00..195.00 rows=1 width=8) (actual rows=0.00 loops=1)
 200    Filter: ((a <= 49) AND (b > 49))
 201    Rows Removed by Filter: 10000