2 14.2. Statistics Used by the Planner #
4 14.2.1. Single-Column Statistics
5 14.2.2. Extended Statistics
7 14.2.1. Single-Column Statistics #
9 As we saw in the previous section, the query planner needs to estimate
10 the number of rows retrieved by a query in order to make good choices
11 of query plans. This section provides a quick look at the statistics
12 that the system uses for these estimates.
14 One component of the statistics is the total number of entries in each
15 table and index, as well as the number of disk blocks occupied by each
16 table and index. This information is kept in the table pg_class, in the
17 columns reltuples and relpages. We can look at it with queries similar
19 SELECT relname, relkind, reltuples, relpages
21 WHERE relname LIKE 'tenk1%';
23 relname | relkind | reltuples | relpages
24 ----------------------+---------+-----------+----------
25 tenk1 | r | 10000 | 345
26 tenk1_hundred | i | 10000 | 11
27 tenk1_thous_tenthous | i | 10000 | 30
28 tenk1_unique1 | i | 10000 | 30
29 tenk1_unique2 | i | 10000 | 30
32 Here we can see that tenk1 contains 10000 rows, as do its indexes, but
33 the indexes are (unsurprisingly) much smaller than the table.
35 For efficiency reasons, reltuples and relpages are not updated
36 on-the-fly, and so they usually contain somewhat out-of-date values.
37 They are updated by VACUUM, ANALYZE, and a few DDL commands such as
38 CREATE INDEX. A VACUUM or ANALYZE operation that does not scan the
39 entire table (which is commonly the case) will incrementally update the
40 reltuples count on the basis of the part of the table it did scan,
41 resulting in an approximate value. In any case, the planner will scale
42 the values it finds in pg_class to match the current physical table
43 size, thus obtaining a closer approximation.
45 Most queries retrieve only a fraction of the rows in a table, due to
46 WHERE clauses that restrict the rows to be examined. The planner thus
47 needs to make an estimate of the selectivity of WHERE clauses, that is,
48 the fraction of rows that match each condition in the WHERE clause. The
49 information used for this task is stored in the pg_statistic system
50 catalog. Entries in pg_statistic are updated by the ANALYZE and VACUUM
51 ANALYZE commands, and are always approximate even when freshly updated.
53 Rather than look at pg_statistic directly, it's better to look at its
54 view pg_stats when examining the statistics manually. pg_stats is
55 designed to be more easily readable. Furthermore, pg_stats is readable
56 by all, whereas pg_statistic is only readable by a superuser. (This
57 prevents unprivileged users from learning something about the contents
58 of other people's tables from the statistics. The pg_stats view is
59 restricted to show only rows about tables that the current user can
60 read.) For example, we might do:
61 SELECT attname, inherited, n_distinct,
62 array_to_string(most_common_vals, E'\n') as most_common_vals
64 WHERE tablename = 'road';
66 attname | inherited | n_distinct | most_common_vals
67 ---------+-----------+------------+------------------------------------
68 name | f | -0.5681108 | I- 580 Ramp+
76 | | | Mac Arthur Blvd+
79 name | t | -0.5125 | I- 580 Ramp+
86 | | | State Hwy 13 Ramp+
88 | | | State Hwy 24 Ramp+
94 Note that two rows are displayed for the same column, one corresponding
95 to the complete inheritance hierarchy starting at the road table
96 (inherited=t), and another one including only the road table itself
97 (inherited=f). (For brevity, we have only shown the first ten
98 most-common values for the name column.)
100 The amount of information stored in pg_statistic by ANALYZE, in
101 particular the maximum number of entries in the most_common_vals and
102 histogram_bounds arrays for each column, can be set on a
103 column-by-column basis using the ALTER TABLE SET STATISTICS command, or
104 globally by setting the default_statistics_target configuration
105 variable. The default limit is presently 100 entries. Raising the limit
106 might allow more accurate planner estimates to be made, particularly
107 for columns with irregular data distributions, at the price of
108 consuming more space in pg_statistic and slightly more time to compute
109 the estimates. Conversely, a lower limit might be sufficient for
110 columns with simple data distributions.
112 Further details about the planner's use of statistics can be found in
115 14.2.2. Extended Statistics #
117 It is common to see slow queries running bad execution plans because
118 multiple columns used in the query clauses are correlated. The planner
119 normally assumes that multiple conditions are independent of each
120 other, an assumption that does not hold when column values are
121 correlated. Regular statistics, because of their per-individual-column
122 nature, cannot capture any knowledge about cross-column correlation.
123 However, PostgreSQL has the ability to compute multivariate statistics,
124 which can capture such information.
126 Because the number of possible column combinations is very large, it's
127 impractical to compute multivariate statistics automatically. Instead,
128 extended statistics objects, more often called just statistics objects,
129 can be created to instruct the server to obtain statistics across
130 interesting sets of columns.
132 Statistics objects are created using the CREATE STATISTICS command.
133 Creation of such an object merely creates a catalog entry expressing
134 interest in the statistics. Actual data collection is performed by
135 ANALYZE (either a manual command, or background auto-analyze). The
136 collected values can be examined in the pg_statistic_ext_data catalog.
138 ANALYZE computes extended statistics based on the same sample of table
139 rows that it takes for computing regular single-column statistics.
140 Since the sample size is increased by increasing the statistics target
141 for the table or any of its columns (as described in the previous
142 section), a larger statistics target will normally result in more
143 accurate extended statistics, as well as more time spent calculating
146 The following subsections describe the kinds of extended statistics
147 that are currently supported.
149 14.2.2.1. Functional Dependencies #
151 The simplest kind of extended statistics tracks functional
152 dependencies, a concept used in definitions of database normal forms.
153 We say that column b is functionally dependent on column a if knowledge
154 of the value of a is sufficient to determine the value of b, that is
155 there are no two rows having the same value of a but different values
156 of b. In a fully normalized database, functional dependencies should
157 exist only on primary keys and superkeys. However, in practice many
158 data sets are not fully normalized for various reasons; intentional
159 denormalization for performance reasons is a common example. Even in a
160 fully normalized database, there may be partial correlation between
161 some columns, which can be expressed as partial functional dependency.
163 The existence of functional dependencies directly affects the accuracy
164 of estimates in certain queries. If a query contains conditions on both
165 the independent and the dependent column(s), the conditions on the
166 dependent columns do not further reduce the result size; but without
167 knowledge of the functional dependency, the query planner will assume
168 that the conditions are independent, resulting in underestimating the
171 To inform the planner about functional dependencies, ANALYZE can
172 collect measurements of cross-column dependency. Assessing the degree
173 of dependency between all sets of columns would be prohibitively
174 expensive, so data collection is limited to those groups of columns
175 appearing together in a statistics object defined with the dependencies
176 option. It is advisable to create dependencies statistics only for
177 column groups that are strongly correlated, to avoid unnecessary
178 overhead in both ANALYZE and later query planning.
180 Here is an example of collecting functional-dependency statistics:
181 CREATE STATISTICS stts (dependencies) ON city, zip FROM zipcodes;
185 SELECT stxname, stxkeys, stxddependencies
186 FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid)
187 WHERE stxname = 'stts';
188 stxname | stxkeys | stxddependencies
189 ---------+---------+------------------------------------------
190 stts | 1 5 | {"1 => 5": 1.000000, "5 => 1": 0.423130}
193 Here it can be seen that column 1 (zip code) fully determines column 5
194 (city) so the coefficient is 1.0, while city only determines zip code
195 about 42% of the time, meaning that there are many cities (58%) that
196 are represented by more than a single ZIP code.
198 When computing the selectivity for a query involving functionally
199 dependent columns, the planner adjusts the per-condition selectivity
200 estimates using the dependency coefficients so as not to produce an
203 14.2.2.1.1. Limitations of Functional Dependencies #
205 Functional dependencies are currently only applied when considering
206 simple equality conditions that compare columns to constant values, and
207 IN clauses with constant values. They are not used to improve estimates
208 for equality conditions comparing two columns or comparing a column to
209 an expression, nor for range clauses, LIKE or any other type of
212 When estimating with functional dependencies, the planner assumes that
213 conditions on the involved columns are compatible and hence redundant.
214 If they are incompatible, the correct estimate would be zero rows, but
215 that possibility is not considered. For example, given a query like
216 SELECT * FROM zipcodes WHERE city = 'San Francisco' AND zip = '94105';
218 the planner will disregard the city clause as not changing the
219 selectivity, which is correct. However, it will make the same
221 SELECT * FROM zipcodes WHERE city = 'San Francisco' AND zip = '90210';
223 even though there will really be zero rows satisfying this query.
224 Functional dependency statistics do not provide enough information to
225 conclude that, however.
227 In many practical situations, this assumption is usually satisfied; for
228 example, there might be a GUI in the application that only allows
229 selecting compatible city and ZIP code values to use in a query. But if
230 that's not the case, functional dependencies may not be a viable
233 14.2.2.2. Multivariate N-Distinct Counts #
235 Single-column statistics store the number of distinct values in each
236 column. Estimates of the number of distinct values when combining more
237 than one column (for example, for GROUP BY a, b) are frequently wrong
238 when the planner only has single-column statistical data, causing it to
241 To improve such estimates, ANALYZE can collect n-distinct statistics
242 for groups of columns. As before, it's impractical to do this for every
243 possible column grouping, so data is collected only for those groups of
244 columns appearing together in a statistics object defined with the
245 ndistinct option. Data will be collected for each possible combination
246 of two or more columns from the set of listed columns.
248 Continuing the previous example, the n-distinct counts in a table of
249 ZIP codes might look like the following:
250 CREATE STATISTICS stts2 (ndistinct) ON city, state, zip FROM zipcodes;
254 SELECT stxkeys AS k, stxdndistinct AS nd
255 FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid)
256 WHERE stxname = 'stts2';
257 -[ RECORD 1 ]--------------------------------------------------------
259 nd | {"1, 2": 33178, "1, 5": 33178, "2, 5": 27435, "1, 2, 5": 33178}
262 This indicates that there are three combinations of columns that have
263 33178 distinct values: ZIP code and state; ZIP code and city; and ZIP
264 code, city and state (the fact that they are all equal is expected
265 given that ZIP code alone is unique in this table). On the other hand,
266 the combination of city and state has only 27435 distinct values.
268 It's advisable to create ndistinct statistics objects only on
269 combinations of columns that are actually used for grouping, and for
270 which misestimation of the number of groups is resulting in bad plans.
271 Otherwise, the ANALYZE cycles are just wasted.
273 14.2.2.3. Multivariate MCV Lists #
275 Another type of statistic stored for each column are most-common value
276 lists. This allows very accurate estimates for individual columns, but
277 may result in significant misestimates for queries with conditions on
280 To improve such estimates, ANALYZE can collect MCV lists on
281 combinations of columns. Similarly to functional dependencies and
282 n-distinct coefficients, it's impractical to do this for every possible
283 column grouping. Even more so in this case, as the MCV list (unlike
284 functional dependencies and n-distinct coefficients) does store the
285 common column values. So data is collected only for those groups of
286 columns appearing together in a statistics object defined with the mcv
289 Continuing the previous example, the MCV list for a table of ZIP codes
290 might look like the following (unlike for simpler types of statistics,
291 a function is required for inspection of MCV contents):
292 CREATE STATISTICS stts3 (mcv) ON city, state FROM zipcodes;
296 SELECT m.* FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid),
297 pg_mcv_list_items(stxdmcv) m WHERE stxname = 'stts3';
299 index | values | nulls | frequency | base_frequency
300 -------+------------------------+-------+-----------+----------------
301 0 | {Washington, DC} | {f,f} | 0.003467 | 2.7e-05
302 1 | {Apo, AE} | {f,f} | 0.003067 | 1.9e-05
303 2 | {Houston, TX} | {f,f} | 0.002167 | 0.000133
304 3 | {El Paso, TX} | {f,f} | 0.002 | 0.000113
305 4 | {New York, NY} | {f,f} | 0.001967 | 0.000114
306 5 | {Atlanta, GA} | {f,f} | 0.001633 | 3.3e-05
307 6 | {Sacramento, CA} | {f,f} | 0.001433 | 7.8e-05
308 7 | {Miami, FL} | {f,f} | 0.0014 | 6e-05
309 8 | {Dallas, TX} | {f,f} | 0.001367 | 8.8e-05
310 9 | {Chicago, IL} | {f,f} | 0.001333 | 5.1e-05
314 This indicates that the most common combination of city and state is
315 Washington in DC, with actual frequency (in the sample) about 0.35%.
316 The base frequency of the combination (as computed from the simple
317 per-column frequencies) is only 0.0027%, resulting in two orders of
318 magnitude under-estimates.
320 It's advisable to create MCV statistics objects only on combinations of
321 columns that are actually used in conditions together, and for which
322 misestimation of the number of groups is resulting in bad plans.
323 Otherwise, the ANALYZE and planning cycles are just wasted.