begriffs open source - ai-pg/blob - full-docs/src/sgml/html/multivariate-statistics-examples.html

   1 <?xml version="1.0" encoding="UTF-8" standalone="no"?>
   2 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>69.2. Multivariate Statistics Examples</title><link rel="stylesheet" type="text/css" href="stylesheet.css" /><link rev="made" href="pgsql-docs@lists.postgresql.org" /><meta name="generator" content="DocBook XSL Stylesheets Vsnapshot" /><link rel="prev" href="row-estimation-examples.html" title="69.1. Row Estimation Examples" /><link rel="next" href="planner-stats-security.html" title="69.3. Planner Statistics and Security" /></head><body id="docContent" class="container-fluid col-10"><div class="navheader"><table width="100%" summary="Navigation header"><tr><th colspan="5" align="center">69.2. Multivariate Statistics Examples</th></tr><tr><td width="10%" align="left"><a accesskey="p" href="row-estimation-examples.html" title="69.1. Row Estimation Examples">Prev</a> </td><td width="10%" align="left"><a accesskey="u" href="planner-stats-details.html" title="Chapter 69. How the Planner Uses Statistics">Up</a></td><th width="60%" align="center">Chapter 69. How the Planner Uses Statistics</th><td width="10%" align="right"><a accesskey="h" href="index.html" title="PostgreSQL 18.0 Documentation">Home</a></td><td width="10%" align="right"> <a accesskey="n" href="planner-stats-security.html" title="69.3. Planner Statistics and Security">Next</a></td></tr></table><hr /></div><div class="sect1" id="MULTIVARIATE-STATISTICS-EXAMPLES"><div class="titlepage"><div><div><h2 class="title" style="clear: both">69.2. Multivariate Statistics Examples <a href="#MULTIVARIATE-STATISTICS-EXAMPLES" class="id_link">#</a></h2></div></div></div><div class="toc"><dl class="toc"><dt><span class="sect2"><a href="multivariate-statistics-examples.html#FUNCTIONAL-DEPENDENCIES">69.2.1. Functional Dependencies</a></span></dt><dt><span class="sect2"><a href="multivariate-statistics-examples.html#MULTIVARIATE-NDISTINCT-COUNTS">69.2.2. Multivariate N-Distinct Counts</a></span></dt><dt><span class="sect2"><a href="multivariate-statistics-examples.html#MCV-LISTS">69.2.3. MCV Lists</a></span></dt></dl></div><a id="id-1.10.21.5.2" class="indexterm"></a><div class="sect2" id="FUNCTIONAL-DEPENDENCIES"><div class="titlepage"><div><div><h3 class="title">69.2.1. Functional Dependencies <a href="#FUNCTIONAL-DEPENDENCIES" class="id_link">#</a></h3></div></div></div><p>
   3     Multivariate correlation can be demonstrated with a very simple data set
   4     — a table with two columns, both containing the same values:
   5
   6 </p><pre class="programlisting">
   7 CREATE TABLE t (a INT, b INT);
   8 INSERT INTO t SELECT i % 100, i % 100 FROM generate_series(1, 10000) s(i);
   9 ANALYZE t;
  10 </pre><p>
  11
  12     As explained in <a class="xref" href="planner-stats.html" title="14.2. Statistics Used by the Planner">Section 14.2</a>, the planner can determine
  13     cardinality of <code class="structname">t</code> using the number of pages and
  14     rows obtained from <code class="structname">pg_class</code>:
  15
  16 </p><pre class="programlisting">
  17 SELECT relpages, reltuples FROM pg_class WHERE relname = 't';
  18
  19  relpages | reltuples
  20 ----------+-----------
  21        45 |     10000
  22 </pre><p>
  23
  24     The data distribution is very simple; there are only 100 distinct values
  25     in each column, uniformly distributed.
  26    </p><p>
  27     The following example shows the result of estimating a <code class="literal">WHERE</code>
  28     condition on the <code class="structfield">a</code> column:
  29
  30 </p><pre class="programlisting">
  31 EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT * FROM t WHERE a = 1;
  32                                  QUERY PLAN
  33 -------------------------------------------------------------------------------
  34  Seq Scan on t  (cost=0.00..170.00 rows=100 width=8) (actual rows=100.00 loops=1)
  35    Filter: (a = 1)
  36    Rows Removed by Filter: 9900
  37 </pre><p>
  38
  39     The planner examines the condition and determines the selectivity
  40     of this clause to be 1%.  By comparing this estimate and the actual
  41     number of rows, we see that the estimate is very accurate
  42     (in fact exact, as the table is very small).  Changing the
  43     <code class="literal">WHERE</code> condition to use the <code class="structfield">b</code> column, an
  44     identical plan is generated.  But observe what happens if we apply the same
  45     condition on both columns, combining them with <code class="literal">AND</code>:
  46
  47 </p><pre class="programlisting">
  48 EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT * FROM t WHERE a = 1 AND b = 1;
  49                                  QUERY PLAN
  50 -----------------------------------------------------------------------------
  51  Seq Scan on t  (cost=0.00..195.00 rows=1 width=8) (actual rows=100.00 loops=1)
  52    Filter: ((a = 1) AND (b = 1))
  53    Rows Removed by Filter: 9900
  54 </pre><p>
  55
  56     The planner estimates the selectivity for each condition individually,
  57     arriving at the same 1% estimates as above.  Then it assumes that the
  58     conditions are independent, and so it multiplies their selectivities,
  59     producing a final selectivity estimate of just 0.01%.
  60     This is a significant underestimate, as the actual number of rows
  61     matching the conditions (100) is two orders of magnitude higher.
  62    </p><p>
  63     This problem can be fixed by creating a statistics object that
  64     directs <code class="command">ANALYZE</code> to calculate functional-dependency
  65     multivariate statistics on the two columns:
  66
  67 </p><pre class="programlisting">
  68 CREATE STATISTICS stts (dependencies) ON a, b FROM t;
  69 ANALYZE t;
  70 EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT * FROM t WHERE a = 1 AND b = 1;
  71                                   QUERY PLAN
  72 -------------------------------------------------------------------------------
  73  Seq Scan on t  (cost=0.00..195.00 rows=100 width=8) (actual rows=100.00 loops=1)
  74    Filter: ((a = 1) AND (b = 1))
  75    Rows Removed by Filter: 9900
  76 </pre><p>
  77    </p></div><div class="sect2" id="MULTIVARIATE-NDISTINCT-COUNTS"><div class="titlepage"><div><div><h3 class="title">69.2.2. Multivariate N-Distinct Counts <a href="#MULTIVARIATE-NDISTINCT-COUNTS" class="id_link">#</a></h3></div></div></div><p>
  78     A similar problem occurs with estimation of the cardinality of sets of
  79     multiple columns, such as the number of groups that would be generated by
  80     a <code class="command">GROUP BY</code> clause.  When <code class="command">GROUP BY</code>
  81     lists a single column, the n-distinct estimate (which is visible as the
  82     estimated number of rows returned by the HashAggregate node) is very
  83     accurate:
  84 </p><pre class="programlisting">
  85 EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT COUNT(*) FROM t GROUP BY a;
  86                                        QUERY PLAN
  87 -----------------------------------------------------------------------------------------
  88  HashAggregate  (cost=195.00..196.00 rows=100 width=12) (actual rows=100.00 loops=1)
  89    Group Key: a
  90    -&gt;  Seq Scan on t  (cost=0.00..145.00 rows=10000 width=4) (actual rows=10000.00 loops=1)
  91 </pre><p>
  92     But without multivariate statistics, the estimate for the number of
  93     groups in a query with two columns in <code class="command">GROUP BY</code>, as
  94     in the following example, is off by an order of magnitude:
  95 </p><pre class="programlisting">
  96 EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT COUNT(*) FROM t GROUP BY a, b;
  97                                        QUERY PLAN
  98 --------------------------------------------------------------------------------------------
  99  HashAggregate  (cost=220.00..230.00 rows=1000 width=16) (actual rows=100.00 loops=1)
 100    Group Key: a, b
 101    -&gt;  Seq Scan on t  (cost=0.00..145.00 rows=10000 width=8) (actual rows=10000.00 loops=1)
 102 </pre><p>
 103     By redefining the statistics object to include n-distinct counts for the
 104     two columns, the estimate is much improved:
 105 </p><pre class="programlisting">
 106 DROP STATISTICS stts;
 107 CREATE STATISTICS stts (dependencies, ndistinct) ON a, b FROM t;
 108 ANALYZE t;
 109 EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT COUNT(*) FROM t GROUP BY a, b;
 110                                        QUERY PLAN
 111 --------------------------------------------------------------------------------------------
 112  HashAggregate  (cost=220.00..221.00 rows=100 width=16) (actual rows=100.00 loops=1)
 113    Group Key: a, b
 114    -&gt;  Seq Scan on t  (cost=0.00..145.00 rows=10000 width=8) (actual rows=10000.00 loops=1)
 115 </pre><p>
 116    </p></div><div class="sect2" id="MCV-LISTS"><div class="titlepage"><div><div><h3 class="title">69.2.3. MCV Lists <a href="#MCV-LISTS" class="id_link">#</a></h3></div></div></div><p>
 117     As explained in <a class="xref" href="multivariate-statistics-examples.html#FUNCTIONAL-DEPENDENCIES" title="69.2.1. Functional Dependencies">Section 69.2.1</a>, functional
 118     dependencies are very cheap and efficient type of statistics, but their
 119     main limitation is their global nature (only tracking dependencies at
 120     the column level, not between individual column values).
 121    </p><p>
 122     This section introduces multivariate variant of <acronym class="acronym">MCV</acronym>
 123     (most-common values) lists, a straightforward extension of the per-column
 124     statistics described in <a class="xref" href="row-estimation-examples.html" title="69.1. Row Estimation Examples">Section 69.1</a>.  These
 125     statistics address the limitation by storing individual values, but it is
 126     naturally more expensive, both in terms of building the statistics in
 127     <code class="command">ANALYZE</code>, storage and planning time.
 128    </p><p>
 129     Let's look at the query from <a class="xref" href="multivariate-statistics-examples.html#FUNCTIONAL-DEPENDENCIES" title="69.2.1. Functional Dependencies">Section 69.2.1</a>
 130     again, but this time with a <acronym class="acronym">MCV</acronym> list created on the
 131     same set of columns (be sure to drop the functional dependencies, to
 132     make sure the planner uses the newly created statistics).
 133
 134 </p><pre class="programlisting">
 135 DROP STATISTICS stts;
 136 CREATE STATISTICS stts2 (mcv) ON a, b FROM t;
 137 ANALYZE t;
 138 EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT * FROM t WHERE a = 1 AND b = 1;
 139                                    QUERY PLAN
 140 -------------------------------------------------------------------------------
 141  Seq Scan on t  (cost=0.00..195.00 rows=100 width=8) (actual rows=100.00 loops=1)
 142    Filter: ((a = 1) AND (b = 1))
 143    Rows Removed by Filter: 9900
 144 </pre><p>
 145
 146     The estimate is as accurate as with the functional dependencies, mostly
 147     thanks to the table being fairly small and having a simple distribution
 148     with a low number of distinct values. Before looking at the second query,
 149     which was not handled by functional dependencies particularly well,
 150     let's inspect the <acronym class="acronym">MCV</acronym> list a bit.
 151    </p><p>
 152     Inspecting the <acronym class="acronym">MCV</acronym> list is possible using
 153     <code class="function">pg_mcv_list_items</code> set-returning function.
 154
 155 </p><pre class="programlisting">
 156 SELECT m.* FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid),
 157                 pg_mcv_list_items(stxdmcv) m WHERE stxname = 'stts2';
 158  index |  values  | nulls | frequency | base_frequency
 159 -------+----------+-------+-----------+----------------
 160      0 | {0, 0}   | {f,f} |      0.01 |         0.0001
 161      1 | {1, 1}   | {f,f} |      0.01 |         0.0001
 162    ...
 163     49 | {49, 49} | {f,f} |      0.01 |         0.0001
 164     50 | {50, 50} | {f,f} |      0.01 |         0.0001
 165    ...
 166     97 | {97, 97} | {f,f} |      0.01 |         0.0001
 167     98 | {98, 98} | {f,f} |      0.01 |         0.0001
 168     99 | {99, 99} | {f,f} |      0.01 |         0.0001
 169 (100 rows)
 170 </pre><p>
 171
 172     This confirms there are 100 distinct combinations in the two columns, and
 173     all of them are about equally likely (1% frequency for each one).  The
 174     base frequency is the frequency computed from per-column statistics, as if
 175     there were no multi-column statistics. Had there been any null values in
 176     either of the columns, this would be identified in the
 177     <code class="structfield">nulls</code> column.
 178    </p><p>
 179     When estimating the selectivity, the planner applies all the conditions
 180     on items in the <acronym class="acronym">MCV</acronym> list, and then sums the frequencies
 181     of the matching ones. See <code class="function">mcv_clauselist_selectivity</code>
 182     in <code class="filename">src/backend/statistics/mcv.c</code> for details.
 183    </p><p>
 184     Compared to functional dependencies, <acronym class="acronym">MCV</acronym> lists have two
 185     major advantages. Firstly, the list stores actual values, making it possible
 186     to decide which combinations are compatible.
 187
 188 </p><pre class="programlisting">
 189 EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT * FROM t WHERE a = 1 AND b = 10;
 190                                  QUERY PLAN
 191 ---------------------------------------------------------------------------
 192  Seq Scan on t  (cost=0.00..195.00 rows=1 width=8) (actual rows=0.00 loops=1)
 193    Filter: ((a = 1) AND (b = 10))
 194    Rows Removed by Filter: 10000
 195 </pre><p>
 196
 197     Secondly, <acronym class="acronym">MCV</acronym> lists handle a wider range of clause types,
 198     not just equality clauses like functional dependencies. For example,
 199     consider the following range query for the same table:
 200
 201 </p><pre class="programlisting">
 202 EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT * FROM t WHERE a &lt;= 49 AND b &gt; 49;
 203                                 QUERY PLAN
 204 ---------------------------------------------------------------------------
 205  Seq Scan on t  (cost=0.00..195.00 rows=1 width=8) (actual rows=0.00 loops=1)
 206    Filter: ((a &lt;= 49) AND (b &gt; 49))
 207    Rows Removed by Filter: 10000
 208 </pre><p>
 209
 210    </p></div></div><div class="navfooter"><hr /><table width="100%" summary="Navigation footer"><tr><td width="40%" align="left"><a accesskey="p" href="row-estimation-examples.html" title="69.1. Row Estimation Examples">Prev</a> </td><td width="20%" align="center"><a accesskey="u" href="planner-stats-details.html" title="Chapter 69. How the Planner Uses Statistics">Up</a></td><td width="40%" align="right"> <a accesskey="n" href="planner-stats-security.html" title="69.3. Planner Statistics and Security">Next</a></td></tr><tr><td width="40%" align="left" valign="top">69.1. Row Estimation Examples </td><td width="20%" align="center"><a accesskey="h" href="index.html" title="PostgreSQL 18.0 Documentation">Home</a></td><td width="40%" align="right" valign="top"> 69.3. Planner Statistics and Security</td></tr></table></div></body></html>