1 <?xml version="1.0" encoding="UTF-8" standalone="no"?>
2 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>69.2. Multivariate Statistics Examples</title><link rel="stylesheet" type="text/css" href="stylesheet.css" /><link rev="made" href="pgsql-docs@lists.postgresql.org" /><meta name="generator" content="DocBook XSL Stylesheets Vsnapshot" /><link rel="prev" href="row-estimation-examples.html" title="69.1. Row Estimation Examples" /><link rel="next" href="planner-stats-security.html" title="69.3. Planner Statistics and Security" /></head><body id="docContent" class="container-fluid col-10"><div class="navheader"><table width="100%" summary="Navigation header"><tr><th colspan="5" align="center">69.2. Multivariate Statistics Examples</th></tr><tr><td width="10%" align="left"><a accesskey="p" href="row-estimation-examples.html" title="69.1. Row Estimation Examples">Prev</a> </td><td width="10%" align="left"><a accesskey="u" href="planner-stats-details.html" title="Chapter 69. How the Planner Uses Statistics">Up</a></td><th width="60%" align="center">Chapter 69. How the Planner Uses Statistics</th><td width="10%" align="right"><a accesskey="h" href="index.html" title="PostgreSQL 18.0 Documentation">Home</a></td><td width="10%" align="right"> <a accesskey="n" href="planner-stats-security.html" title="69.3. Planner Statistics and Security">Next</a></td></tr></table><hr /></div><div class="sect1" id="MULTIVARIATE-STATISTICS-EXAMPLES"><div class="titlepage"><div><div><h2 class="title" style="clear: both">69.2. Multivariate Statistics Examples <a href="#MULTIVARIATE-STATISTICS-EXAMPLES" class="id_link">#</a></h2></div></div></div><div class="toc"><dl class="toc"><dt><span class="sect2"><a href="multivariate-statistics-examples.html#FUNCTIONAL-DEPENDENCIES">69.2.1. Functional Dependencies</a></span></dt><dt><span class="sect2"><a href="multivariate-statistics-examples.html#MULTIVARIATE-NDISTINCT-COUNTS">69.2.2. Multivariate N-Distinct Counts</a></span></dt><dt><span class="sect2"><a href="multivariate-statistics-examples.html#MCV-LISTS">69.2.3. MCV Lists</a></span></dt></dl></div><a id="id-1.10.21.5.2" class="indexterm"></a><div class="sect2" id="FUNCTIONAL-DEPENDENCIES"><div class="titlepage"><div><div><h3 class="title">69.2.1. Functional Dependencies <a href="#FUNCTIONAL-DEPENDENCIES" class="id_link">#</a></h3></div></div></div><p>
3 Multivariate correlation can be demonstrated with a very simple data set
4 — a table with two columns, both containing the same values:
6 </p><pre class="programlisting">
7 CREATE TABLE t (a INT, b INT);
8 INSERT INTO t SELECT i % 100, i % 100 FROM generate_series(1, 10000) s(i);
12 As explained in <a class="xref" href="planner-stats.html" title="14.2. Statistics Used by the Planner">Section 14.2</a>, the planner can determine
13 cardinality of <code class="structname">t</code> using the number of pages and
14 rows obtained from <code class="structname">pg_class</code>:
16 </p><pre class="programlisting">
17 SELECT relpages, reltuples FROM pg_class WHERE relname = 't';
20 ----------+-----------
24 The data distribution is very simple; there are only 100 distinct values
25 in each column, uniformly distributed.
27 The following example shows the result of estimating a <code class="literal">WHERE</code>
28 condition on the <code class="structfield">a</code> column:
30 </p><pre class="programlisting">
31 EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT * FROM t WHERE a = 1;
33 -------------------------------------------------------------------------------
34 Seq Scan on t (cost=0.00..170.00 rows=100 width=8) (actual rows=100.00 loops=1)
36 Rows Removed by Filter: 9900
39 The planner examines the condition and determines the selectivity
40 of this clause to be 1%. By comparing this estimate and the actual
41 number of rows, we see that the estimate is very accurate
42 (in fact exact, as the table is very small). Changing the
43 <code class="literal">WHERE</code> condition to use the <code class="structfield">b</code> column, an
44 identical plan is generated. But observe what happens if we apply the same
45 condition on both columns, combining them with <code class="literal">AND</code>:
47 </p><pre class="programlisting">
48 EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT * FROM t WHERE a = 1 AND b = 1;
50 -----------------------------------------------------------------------------
51 Seq Scan on t (cost=0.00..195.00 rows=1 width=8) (actual rows=100.00 loops=1)
52 Filter: ((a = 1) AND (b = 1))
53 Rows Removed by Filter: 9900
56 The planner estimates the selectivity for each condition individually,
57 arriving at the same 1% estimates as above. Then it assumes that the
58 conditions are independent, and so it multiplies their selectivities,
59 producing a final selectivity estimate of just 0.01%.
60 This is a significant underestimate, as the actual number of rows
61 matching the conditions (100) is two orders of magnitude higher.
63 This problem can be fixed by creating a statistics object that
64 directs <code class="command">ANALYZE</code> to calculate functional-dependency
65 multivariate statistics on the two columns:
67 </p><pre class="programlisting">
68 CREATE STATISTICS stts (dependencies) ON a, b FROM t;
70 EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT * FROM t WHERE a = 1 AND b = 1;
72 -------------------------------------------------------------------------------
73 Seq Scan on t (cost=0.00..195.00 rows=100 width=8) (actual rows=100.00 loops=1)
74 Filter: ((a = 1) AND (b = 1))
75 Rows Removed by Filter: 9900
77 </p></div><div class="sect2" id="MULTIVARIATE-NDISTINCT-COUNTS"><div class="titlepage"><div><div><h3 class="title">69.2.2. Multivariate N-Distinct Counts <a href="#MULTIVARIATE-NDISTINCT-COUNTS" class="id_link">#</a></h3></div></div></div><p>
78 A similar problem occurs with estimation of the cardinality of sets of
79 multiple columns, such as the number of groups that would be generated by
80 a <code class="command">GROUP BY</code> clause. When <code class="command">GROUP BY</code>
81 lists a single column, the n-distinct estimate (which is visible as the
82 estimated number of rows returned by the HashAggregate node) is very
84 </p><pre class="programlisting">
85 EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT COUNT(*) FROM t GROUP BY a;
87 -----------------------------------------------------------------------------------------
88 HashAggregate (cost=195.00..196.00 rows=100 width=12) (actual rows=100.00 loops=1)
90 -> Seq Scan on t (cost=0.00..145.00 rows=10000 width=4) (actual rows=10000.00 loops=1)
92 But without multivariate statistics, the estimate for the number of
93 groups in a query with two columns in <code class="command">GROUP BY</code>, as
94 in the following example, is off by an order of magnitude:
95 </p><pre class="programlisting">
96 EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT COUNT(*) FROM t GROUP BY a, b;
98 --------------------------------------------------------------------------------------------
99 HashAggregate (cost=220.00..230.00 rows=1000 width=16) (actual rows=100.00 loops=1)
101 -> Seq Scan on t (cost=0.00..145.00 rows=10000 width=8) (actual rows=10000.00 loops=1)
103 By redefining the statistics object to include n-distinct counts for the
104 two columns, the estimate is much improved:
105 </p><pre class="programlisting">
106 DROP STATISTICS stts;
107 CREATE STATISTICS stts (dependencies, ndistinct) ON a, b FROM t;
109 EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT COUNT(*) FROM t GROUP BY a, b;
111 --------------------------------------------------------------------------------------------
112 HashAggregate (cost=220.00..221.00 rows=100 width=16) (actual rows=100.00 loops=1)
114 -> Seq Scan on t (cost=0.00..145.00 rows=10000 width=8) (actual rows=10000.00 loops=1)
116 </p></div><div class="sect2" id="MCV-LISTS"><div class="titlepage"><div><div><h3 class="title">69.2.3. MCV Lists <a href="#MCV-LISTS" class="id_link">#</a></h3></div></div></div><p>
117 As explained in <a class="xref" href="multivariate-statistics-examples.html#FUNCTIONAL-DEPENDENCIES" title="69.2.1. Functional Dependencies">Section 69.2.1</a>, functional
118 dependencies are very cheap and efficient type of statistics, but their
119 main limitation is their global nature (only tracking dependencies at
120 the column level, not between individual column values).
122 This section introduces multivariate variant of <acronym class="acronym">MCV</acronym>
123 (most-common values) lists, a straightforward extension of the per-column
124 statistics described in <a class="xref" href="row-estimation-examples.html" title="69.1. Row Estimation Examples">Section 69.1</a>. These
125 statistics address the limitation by storing individual values, but it is
126 naturally more expensive, both in terms of building the statistics in
127 <code class="command">ANALYZE</code>, storage and planning time.
129 Let's look at the query from <a class="xref" href="multivariate-statistics-examples.html#FUNCTIONAL-DEPENDENCIES" title="69.2.1. Functional Dependencies">Section 69.2.1</a>
130 again, but this time with a <acronym class="acronym">MCV</acronym> list created on the
131 same set of columns (be sure to drop the functional dependencies, to
132 make sure the planner uses the newly created statistics).
134 </p><pre class="programlisting">
135 DROP STATISTICS stts;
136 CREATE STATISTICS stts2 (mcv) ON a, b FROM t;
138 EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT * FROM t WHERE a = 1 AND b = 1;
140 -------------------------------------------------------------------------------
141 Seq Scan on t (cost=0.00..195.00 rows=100 width=8) (actual rows=100.00 loops=1)
142 Filter: ((a = 1) AND (b = 1))
143 Rows Removed by Filter: 9900
146 The estimate is as accurate as with the functional dependencies, mostly
147 thanks to the table being fairly small and having a simple distribution
148 with a low number of distinct values. Before looking at the second query,
149 which was not handled by functional dependencies particularly well,
150 let's inspect the <acronym class="acronym">MCV</acronym> list a bit.
152 Inspecting the <acronym class="acronym">MCV</acronym> list is possible using
153 <code class="function">pg_mcv_list_items</code> set-returning function.
155 </p><pre class="programlisting">
156 SELECT m.* FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid),
157 pg_mcv_list_items(stxdmcv) m WHERE stxname = 'stts2';
158 index | values | nulls | frequency | base_frequency
159 -------+----------+-------+-----------+----------------
160 0 | {0, 0} | {f,f} | 0.01 | 0.0001
161 1 | {1, 1} | {f,f} | 0.01 | 0.0001
163 49 | {49, 49} | {f,f} | 0.01 | 0.0001
164 50 | {50, 50} | {f,f} | 0.01 | 0.0001
166 97 | {97, 97} | {f,f} | 0.01 | 0.0001
167 98 | {98, 98} | {f,f} | 0.01 | 0.0001
168 99 | {99, 99} | {f,f} | 0.01 | 0.0001
172 This confirms there are 100 distinct combinations in the two columns, and
173 all of them are about equally likely (1% frequency for each one). The
174 base frequency is the frequency computed from per-column statistics, as if
175 there were no multi-column statistics. Had there been any null values in
176 either of the columns, this would be identified in the
177 <code class="structfield">nulls</code> column.
179 When estimating the selectivity, the planner applies all the conditions
180 on items in the <acronym class="acronym">MCV</acronym> list, and then sums the frequencies
181 of the matching ones. See <code class="function">mcv_clauselist_selectivity</code>
182 in <code class="filename">src/backend/statistics/mcv.c</code> for details.
184 Compared to functional dependencies, <acronym class="acronym">MCV</acronym> lists have two
185 major advantages. Firstly, the list stores actual values, making it possible
186 to decide which combinations are compatible.
188 </p><pre class="programlisting">
189 EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT * FROM t WHERE a = 1 AND b = 10;
191 ---------------------------------------------------------------------------
192 Seq Scan on t (cost=0.00..195.00 rows=1 width=8) (actual rows=0.00 loops=1)
193 Filter: ((a = 1) AND (b = 10))
194 Rows Removed by Filter: 10000
197 Secondly, <acronym class="acronym">MCV</acronym> lists handle a wider range of clause types,
198 not just equality clauses like functional dependencies. For example,
199 consider the following range query for the same table:
201 </p><pre class="programlisting">
202 EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT * FROM t WHERE a <= 49 AND b > 49;
204 ---------------------------------------------------------------------------
205 Seq Scan on t (cost=0.00..195.00 rows=1 width=8) (actual rows=0.00 loops=1)
206 Filter: ((a <= 49) AND (b > 49))
207 Rows Removed by Filter: 10000
210 </p></div></div><div class="navfooter"><hr /><table width="100%" summary="Navigation footer"><tr><td width="40%" align="left"><a accesskey="p" href="row-estimation-examples.html" title="69.1. Row Estimation Examples">Prev</a> </td><td width="20%" align="center"><a accesskey="u" href="planner-stats-details.html" title="Chapter 69. How the Planner Uses Statistics">Up</a></td><td width="40%" align="right"> <a accesskey="n" href="planner-stats-security.html" title="69.3. Planner Statistics and Security">Next</a></td></tr><tr><td width="40%" align="left" valign="top">69.1. Row Estimation Examples </td><td width="20%" align="center"><a accesskey="h" href="index.html" title="PostgreSQL 18.0 Documentation">Home</a></td><td width="40%" align="right" valign="top"> 69.3. Planner Statistics and Security</td></tr></table></div></body></html>