1 <?xml version="1.0" encoding="UTF-8" standalone="no"?>
2 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>14.2. Statistics Used by the Planner</title><link rel="stylesheet" type="text/css" href="stylesheet.css" /><link rev="made" href="pgsql-docs@lists.postgresql.org" /><meta name="generator" content="DocBook XSL Stylesheets Vsnapshot" /><link rel="prev" href="using-explain.html" title="14.1. Using EXPLAIN" /><link rel="next" href="explicit-joins.html" title="14.3. Controlling the Planner with Explicit JOIN Clauses" /></head><body id="docContent" class="container-fluid col-10"><div class="navheader"><table width="100%" summary="Navigation header"><tr><th colspan="5" align="center">14.2. Statistics Used by the Planner</th></tr><tr><td width="10%" align="left"><a accesskey="p" href="using-explain.html" title="14.1. Using EXPLAIN">Prev</a> </td><td width="10%" align="left"><a accesskey="u" href="performance-tips.html" title="Chapter 14. Performance Tips">Up</a></td><th width="60%" align="center">Chapter 14. Performance Tips</th><td width="10%" align="right"><a accesskey="h" href="index.html" title="PostgreSQL 18.0 Documentation">Home</a></td><td width="10%" align="right"> <a accesskey="n" href="explicit-joins.html" title="14.3. Controlling the Planner with Explicit JOIN Clauses">Next</a></td></tr></table><hr /></div><div class="sect1" id="PLANNER-STATS"><div class="titlepage"><div><div><h2 class="title" style="clear: both">14.2. Statistics Used by the Planner <a href="#PLANNER-STATS" class="id_link">#</a></h2></div></div></div><div class="toc"><dl class="toc"><dt><span class="sect2"><a href="planner-stats.html#PLANNER-STATS-SINGLE-COLUMN">14.2.1. Single-Column Statistics</a></span></dt><dt><span class="sect2"><a href="planner-stats.html#PLANNER-STATS-EXTENDED">14.2.2. Extended Statistics</a></span></dt></dl></div><a id="id-1.5.13.5.2" class="indexterm"></a><div class="sect2" id="PLANNER-STATS-SINGLE-COLUMN"><div class="titlepage"><div><div><h3 class="title">14.2.1. Single-Column Statistics <a href="#PLANNER-STATS-SINGLE-COLUMN" class="id_link">#</a></h3></div></div></div><p>
3 As we saw in the previous section, the query planner needs to estimate
4 the number of rows retrieved by a query in order to make good choices
5 of query plans. This section provides a quick look at the statistics
6 that the system uses for these estimates.
8 One component of the statistics is the total number of entries in
9 each table and index, as well as the number of disk blocks occupied
10 by each table and index. This information is kept in the table
11 <a class="link" href="catalog-pg-class.html" title="52.11. pg_class"><code class="structname">pg_class</code></a>,
12 in the columns <code class="structfield">reltuples</code> and
13 <code class="structfield">relpages</code>. We can look at it with
14 queries similar to this one:
16 </p><pre class="screen">
17 SELECT relname, relkind, reltuples, relpages
19 WHERE relname LIKE 'tenk1%';
21 relname | relkind | reltuples | relpages
22 ----------------------+---------+-----------+----------
23 tenk1 | r | 10000 | 345
24 tenk1_hundred | i | 10000 | 11
25 tenk1_thous_tenthous | i | 10000 | 30
26 tenk1_unique1 | i | 10000 | 30
27 tenk1_unique2 | i | 10000 | 30
31 Here we can see that <code class="structname">tenk1</code> contains 10000
32 rows, as do its indexes, but the indexes are (unsurprisingly) much
33 smaller than the table.
35 For efficiency reasons, <code class="structfield">reltuples</code>
36 and <code class="structfield">relpages</code> are not updated on-the-fly,
37 and so they usually contain somewhat out-of-date values.
38 They are updated by <code class="command">VACUUM</code>, <code class="command">ANALYZE</code>, and a
39 few DDL commands such as <code class="command">CREATE INDEX</code>. A <code class="command">VACUUM</code>
40 or <code class="command">ANALYZE</code> operation that does not scan the entire table
41 (which is commonly the case) will incrementally update the
42 <code class="structfield">reltuples</code> count on the basis of the part
43 of the table it did scan, resulting in an approximate value.
44 In any case, the planner
45 will scale the values it finds in <code class="structname">pg_class</code>
46 to match the current physical table size, thus obtaining a closer
48 </p><a id="id-1.5.13.5.3.5" class="indexterm"></a><p>
49 Most queries retrieve only a fraction of the rows in a table, due
50 to <code class="literal">WHERE</code> clauses that restrict the rows to be
51 examined. The planner thus needs to make an estimate of the
52 <em class="firstterm">selectivity</em> of <code class="literal">WHERE</code> clauses, that is,
53 the fraction of rows that match each condition in the
54 <code class="literal">WHERE</code> clause. The information used for this task is
56 <a class="link" href="catalog-pg-statistic.html" title="52.51. pg_statistic"><code class="structname">pg_statistic</code></a>
57 system catalog. Entries in <code class="structname">pg_statistic</code>
58 are updated by the <code class="command">ANALYZE</code> and <code class="command">VACUUM
59 ANALYZE</code> commands, and are always approximate even when freshly
61 </p><a id="id-1.5.13.5.3.7" class="indexterm"></a><p>
62 Rather than look at <code class="structname">pg_statistic</code> directly,
63 it's better to look at its view
64 <a class="link" href="view-pg-stats.html" title="53.29. pg_stats"><code class="structname">pg_stats</code></a>
65 when examining the statistics manually. <code class="structname">pg_stats</code>
66 is designed to be more easily readable. Furthermore,
67 <code class="structname">pg_stats</code> is readable by all, whereas
68 <code class="structname">pg_statistic</code> is only readable by a superuser.
69 (This prevents unprivileged users from learning something about
70 the contents of other people's tables from the statistics. The
71 <code class="structname">pg_stats</code> view is restricted to show only
72 rows about tables that the current user can read.)
73 For example, we might do:
75 </p><pre class="screen">
76 SELECT attname, inherited, n_distinct,
77 array_to_string(most_common_vals, E'\n') as most_common_vals
79 WHERE tablename = 'road';
81 attname | inherited | n_distinct | most_common_vals
82 ---------+-----------+------------+------------------------------------
83 name | f | -0.5681108 | I- 580 Ramp+
91 | | | Mac Arthur Blvd+
94 name | t | -0.5125 | I- 580 Ramp+
101 | | | State Hwy 13 Ramp+
103 | | | State Hwy 24 Ramp+
110 Note that two rows are displayed for the same column, one corresponding
111 to the complete inheritance hierarchy starting at the
112 <code class="literal">road</code> table (<code class="literal">inherited</code>=<code class="literal">t</code>),
113 and another one including only the <code class="literal">road</code> table itself
114 (<code class="literal">inherited</code>=<code class="literal">f</code>).
115 (For brevity, we have only shown the first ten most-common values for
116 the <code class="literal">name</code> column.)
118 The amount of information stored in <code class="structname">pg_statistic</code>
119 by <code class="command">ANALYZE</code>, in particular the maximum number of entries in the
120 <code class="structfield">most_common_vals</code> and <code class="structfield">histogram_bounds</code>
121 arrays for each column, can be set on a
122 column-by-column basis using the <code class="command">ALTER TABLE SET STATISTICS</code>
123 command, or globally by setting the
124 <a class="xref" href="runtime-config-query.html#GUC-DEFAULT-STATISTICS-TARGET">default_statistics_target</a> configuration variable.
125 The default limit is presently 100 entries. Raising the limit
126 might allow more accurate planner estimates to be made, particularly for
127 columns with irregular data distributions, at the price of consuming
128 more space in <code class="structname">pg_statistic</code> and slightly more
129 time to compute the estimates. Conversely, a lower limit might be
130 sufficient for columns with simple data distributions.
132 Further details about the planner's use of statistics can be found in
133 <a class="xref" href="planner-stats-details.html" title="Chapter 69. How the Planner Uses Statistics">Chapter 69</a>.
134 </p></div><div class="sect2" id="PLANNER-STATS-EXTENDED"><div class="titlepage"><div><div><h3 class="title">14.2.2. Extended Statistics <a href="#PLANNER-STATS-EXTENDED" class="id_link">#</a></h3></div></div></div><a id="id-1.5.13.5.4.2" class="indexterm"></a><a id="id-1.5.13.5.4.3" class="indexterm"></a><a id="id-1.5.13.5.4.4" class="indexterm"></a><a id="id-1.5.13.5.4.5" class="indexterm"></a><p>
135 It is common to see slow queries running bad execution plans because
136 multiple columns used in the query clauses are correlated.
137 The planner normally assumes that multiple conditions
138 are independent of each other,
139 an assumption that does not hold when column values are correlated.
140 Regular statistics, because of their per-individual-column nature,
141 cannot capture any knowledge about cross-column correlation.
142 However, <span class="productname">PostgreSQL</span> has the ability to compute
143 <em class="firstterm">multivariate statistics</em>, which can capture
146 Because the number of possible column combinations is very large,
147 it's impractical to compute multivariate statistics automatically.
148 Instead, <em class="firstterm">extended statistics objects</em>, more often
149 called just <em class="firstterm">statistics objects</em>, can be created to instruct
150 the server to obtain statistics across interesting sets of columns.
152 Statistics objects are created using the
153 <a class="link" href="sql-createstatistics.html" title="CREATE STATISTICS"><code class="command">CREATE STATISTICS</code></a> command.
154 Creation of such an object merely creates a catalog entry expressing
155 interest in the statistics. Actual data collection is performed
156 by <code class="command">ANALYZE</code> (either a manual command, or background
157 auto-analyze). The collected values can be examined in the
158 <a class="link" href="catalog-pg-statistic-ext-data.html" title="52.53. pg_statistic_ext_data"><code class="structname">pg_statistic_ext_data</code></a>
161 <code class="command">ANALYZE</code> computes extended statistics based on the same
162 sample of table rows that it takes for computing regular single-column
163 statistics. Since the sample size is increased by increasing the
164 statistics target for the table or any of its columns (as described in
165 the previous section), a larger statistics target will normally result in
166 more accurate extended statistics, as well as more time spent calculating
169 The following subsections describe the kinds of extended statistics
170 that are currently supported.
171 </p><div class="sect3" id="PLANNER-STATS-EXTENDED-FUNCTIONAL-DEPS"><div class="titlepage"><div><div><h4 class="title">14.2.2.1. Functional Dependencies <a href="#PLANNER-STATS-EXTENDED-FUNCTIONAL-DEPS" class="id_link">#</a></h4></div></div></div><p>
172 The simplest kind of extended statistics tracks <em class="firstterm">functional
173 dependencies</em>, a concept used in definitions of database normal forms.
174 We say that column <code class="structfield">b</code> is functionally dependent on
175 column <code class="structfield">a</code> if knowledge of the value of
176 <code class="structfield">a</code> is sufficient to determine the value
177 of <code class="structfield">b</code>, that is there are no two rows having the same value
178 of <code class="structfield">a</code> but different values of <code class="structfield">b</code>.
179 In a fully normalized database, functional dependencies should exist
180 only on primary keys and superkeys. However, in practice many data sets
181 are not fully normalized for various reasons; intentional
182 denormalization for performance reasons is a common example.
183 Even in a fully normalized database, there may be partial correlation
184 between some columns, which can be expressed as partial functional
187 The existence of functional dependencies directly affects the accuracy
188 of estimates in certain queries. If a query contains conditions on
189 both the independent and the dependent column(s), the
190 conditions on the dependent columns do not further reduce the result
191 size; but without knowledge of the functional dependency, the query
192 planner will assume that the conditions are independent, resulting
193 in underestimating the result size.
195 To inform the planner about functional dependencies, <code class="command">ANALYZE</code>
196 can collect measurements of cross-column dependency. Assessing the
197 degree of dependency between all sets of columns would be prohibitively
198 expensive, so data collection is limited to those groups of columns
199 appearing together in a statistics object defined with
200 the <code class="literal">dependencies</code> option. It is advisable to create
201 <code class="literal">dependencies</code> statistics only for column groups that are
202 strongly correlated, to avoid unnecessary overhead in both
203 <code class="command">ANALYZE</code> and later query planning.
205 Here is an example of collecting functional-dependency statistics:
206 </p><pre class="programlisting">
207 CREATE STATISTICS stts (dependencies) ON city, zip FROM zipcodes;
211 SELECT stxname, stxkeys, stxddependencies
212 FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid)
213 WHERE stxname = 'stts';
214 stxname | stxkeys | stxddependencies
215 ---------+---------+------------------------------------------
216 stts | 1 5 | {"1 => 5": 1.000000, "5 => 1": 0.423130}
219 Here it can be seen that column 1 (zip code) fully determines column
220 5 (city) so the coefficient is 1.0, while city only determines zip code
221 about 42% of the time, meaning that there are many cities (58%) that are
222 represented by more than a single ZIP code.
224 When computing the selectivity for a query involving functionally
225 dependent columns, the planner adjusts the per-condition selectivity
226 estimates using the dependency coefficients so as not to produce
228 </p><div class="sect4" id="PLANNER-STATS-EXTENDED-FUNCTIONAL-DEPS-LIMITS"><div class="titlepage"><div><div><h5 class="title">14.2.2.1.1. Limitations of Functional Dependencies <a href="#PLANNER-STATS-EXTENDED-FUNCTIONAL-DEPS-LIMITS" class="id_link">#</a></h5></div></div></div><p>
229 Functional dependencies are currently only applied when considering
230 simple equality conditions that compare columns to constant values,
231 and <code class="literal">IN</code> clauses with constant values.
232 They are not used to improve estimates for equality conditions
233 comparing two columns or comparing a column to an expression, nor for
234 range clauses, <code class="literal">LIKE</code> or any other type of condition.
236 When estimating with functional dependencies, the planner assumes that
237 conditions on the involved columns are compatible and hence redundant.
238 If they are incompatible, the correct estimate would be zero rows, but
239 that possibility is not considered. For example, given a query like
240 </p><pre class="programlisting">
241 SELECT * FROM zipcodes WHERE city = 'San Francisco' AND zip = '94105';
243 the planner will disregard the <code class="structfield">city</code> clause as not
244 changing the selectivity, which is correct. However, it will make
245 the same assumption about
246 </p><pre class="programlisting">
247 SELECT * FROM zipcodes WHERE city = 'San Francisco' AND zip = '90210';
249 even though there will really be zero rows satisfying this query.
250 Functional dependency statistics do not provide enough information
251 to conclude that, however.
253 In many practical situations, this assumption is usually satisfied;
254 for example, there might be a GUI in the application that only allows
255 selecting compatible city and ZIP code values to use in a query.
256 But if that's not the case, functional dependencies may not be a viable
258 </p></div></div><div class="sect3" id="PLANNER-STATS-EXTENDED-N-DISTINCT-COUNTS"><div class="titlepage"><div><div><h4 class="title">14.2.2.2. Multivariate N-Distinct Counts <a href="#PLANNER-STATS-EXTENDED-N-DISTINCT-COUNTS" class="id_link">#</a></h4></div></div></div><p>
259 Single-column statistics store the number of distinct values in each
260 column. Estimates of the number of distinct values when combining more
261 than one column (for example, for <code class="literal">GROUP BY a, b</code>) are
262 frequently wrong when the planner only has single-column statistical
263 data, causing it to select bad plans.
265 To improve such estimates, <code class="command">ANALYZE</code> can collect n-distinct
266 statistics for groups of columns. As before, it's impractical to do
267 this for every possible column grouping, so data is collected only for
268 those groups of columns appearing together in a statistics object
269 defined with the <code class="literal">ndistinct</code> option. Data will be collected
270 for each possible combination of two or more columns from the set of
273 Continuing the previous example, the n-distinct counts in a
274 table of ZIP codes might look like the following:
275 </p><pre class="programlisting">
276 CREATE STATISTICS stts2 (ndistinct) ON city, state, zip FROM zipcodes;
280 SELECT stxkeys AS k, stxdndistinct AS nd
281 FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid)
282 WHERE stxname = 'stts2';
283 -[ RECORD 1 ]--------------------------------------------------------
285 nd | {"1, 2": 33178, "1, 5": 33178, "2, 5": 27435, "1, 2, 5": 33178}
288 This indicates that there are three combinations of columns that
289 have 33178 distinct values: ZIP code and state; ZIP code and city;
290 and ZIP code, city and state (the fact that they are all equal is
291 expected given that ZIP code alone is unique in this table). On the
292 other hand, the combination of city and state has only 27435 distinct
295 It's advisable to create <code class="literal">ndistinct</code> statistics objects only
296 on combinations of columns that are actually used for grouping, and
297 for which misestimation of the number of groups is resulting in bad
298 plans. Otherwise, the <code class="command">ANALYZE</code> cycles are just wasted.
299 </p></div><div class="sect3" id="PLANNER-STATS-EXTENDED-MCV-LISTS"><div class="titlepage"><div><div><h4 class="title">14.2.2.3. Multivariate MCV Lists <a href="#PLANNER-STATS-EXTENDED-MCV-LISTS" class="id_link">#</a></h4></div></div></div><p>
300 Another type of statistic stored for each column are most-common value
301 lists. This allows very accurate estimates for individual columns, but
302 may result in significant misestimates for queries with conditions on
305 To improve such estimates, <code class="command">ANALYZE</code> can collect MCV
306 lists on combinations of columns. Similarly to functional dependencies
307 and n-distinct coefficients, it's impractical to do this for every
308 possible column grouping. Even more so in this case, as the MCV list
309 (unlike functional dependencies and n-distinct coefficients) does store
310 the common column values. So data is collected only for those groups
311 of columns appearing together in a statistics object defined with the
312 <code class="literal">mcv</code> option.
314 Continuing the previous example, the MCV list for a table of ZIP codes
315 might look like the following (unlike for simpler types of statistics,
316 a function is required for inspection of MCV contents):
318 </p><pre class="programlisting">
319 CREATE STATISTICS stts3 (mcv) ON city, state FROM zipcodes;
323 SELECT m.* FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid),
324 pg_mcv_list_items(stxdmcv) m WHERE stxname = 'stts3';
326 index | values | nulls | frequency | base_frequency
327 -------+------------------------+-------+-----------+----------------
328 0 | {Washington, DC} | {f,f} | 0.003467 | 2.7e-05
329 1 | {Apo, AE} | {f,f} | 0.003067 | 1.9e-05
330 2 | {Houston, TX} | {f,f} | 0.002167 | 0.000133
331 3 | {El Paso, TX} | {f,f} | 0.002 | 0.000113
332 4 | {New York, NY} | {f,f} | 0.001967 | 0.000114
333 5 | {Atlanta, GA} | {f,f} | 0.001633 | 3.3e-05
334 6 | {Sacramento, CA} | {f,f} | 0.001433 | 7.8e-05
335 7 | {Miami, FL} | {f,f} | 0.0014 | 6e-05
336 8 | {Dallas, TX} | {f,f} | 0.001367 | 8.8e-05
337 9 | {Chicago, IL} | {f,f} | 0.001333 | 5.1e-05
341 This indicates that the most common combination of city and state is
342 Washington in DC, with actual frequency (in the sample) about 0.35%.
343 The base frequency of the combination (as computed from the simple
344 per-column frequencies) is only 0.0027%, resulting in two orders of
345 magnitude under-estimates.
347 It's advisable to create <acronym class="acronym">MCV</acronym> statistics objects only
348 on combinations of columns that are actually used in conditions together,
349 and for which misestimation of the number of groups is resulting in bad
350 plans. Otherwise, the <code class="command">ANALYZE</code> and planning cycles
352 </p></div></div></div><div class="navfooter"><hr /><table width="100%" summary="Navigation footer"><tr><td width="40%" align="left"><a accesskey="p" href="using-explain.html" title="14.1. Using EXPLAIN">Prev</a> </td><td width="20%" align="center"><a accesskey="u" href="performance-tips.html" title="Chapter 14. Performance Tips">Up</a></td><td width="40%" align="right"> <a accesskey="n" href="explicit-joins.html" title="14.3. Controlling the Planner with Explicit JOIN Clauses">Next</a></td></tr><tr><td width="40%" align="left" valign="top">14.1. Using <code class="command">EXPLAIN</code> </td><td width="20%" align="center"><a accesskey="h" href="index.html" title="PostgreSQL 18.0 Documentation">Home</a></td><td width="40%" align="right" valign="top"> 14.3. Controlling the Planner with Explicit <code class="literal">JOIN</code> Clauses</td></tr></table></div></body></html>