4 CREATE STATISTICS — define extended statistics
8 CREATE STATISTICS [ [ IF NOT EXISTS ] statistics_name ]
12 CREATE STATISTICS [ [ IF NOT EXISTS ] statistics_name ]
13 [ ( statistics_kind [, ... ] ) ]
14 ON { column_name | ( expression ) }, { column_name | ( expression ) } [, ...
20 CREATE STATISTICS will create a new extended statistics object tracking
21 data about the specified table, foreign table or materialized view. The
22 statistics object will be created in the current database and will be
23 owned by the user issuing the command.
25 The CREATE STATISTICS command has two basic forms. The first form
26 allows univariate statistics for a single expression to be collected,
27 providing benefits similar to an expression index without the overhead
28 of index maintenance. This form does not allow the statistics kind to
29 be specified, since the various statistics kinds refer only to
30 multivariate statistics. The second form of the command allows
31 multivariate statistics on multiple columns and/or expressions to be
32 collected, optionally specifying which statistics kinds to include.
33 This form will also automatically cause univariate statistics to be
34 collected on any expressions included in the list.
36 If a schema name is given (for example, CREATE STATISTICS
37 myschema.mystat ...) then the statistics object is created in the
38 specified schema. Otherwise it is created in the current schema. If
39 given, the name of the statistics object must be distinct from the name
40 of any other statistics object in the same schema.
45 Do not throw an error if a statistics object with the same name
46 already exists. A notice is issued in this case. Note that only
47 the name of the statistics object is considered here, not the
48 details of its definition. Statistics name is required when IF
49 NOT EXISTS is specified.
52 The name (optionally schema-qualified) of the statistics object
53 to be created. If the name is omitted, PostgreSQL chooses a
54 suitable name based on the parent table's name and the defined
55 column name(s) and/or expression(s).
58 A multivariate statistics kind to be computed in this statistics
59 object. Currently supported kinds are ndistinct, which enables
60 n-distinct statistics, dependencies, which enables functional
61 dependency statistics, and mcv which enables most-common values
62 lists. If this clause is omitted, all supported statistics kinds
63 are included in the statistics object. Univariate expression
64 statistics are built automatically if the statistics definition
65 includes any complex expressions rather than just simple column
66 references. For more information, see Section 14.2.2 and
70 The name of a table column to be covered by the computed
71 statistics. This is only allowed when building multivariate
72 statistics. At least two column names or expressions must be
73 specified, and their order is not significant.
76 An expression to be covered by the computed statistics. This may
77 be used to build univariate statistics on a single expression,
78 or as part of a list of multiple column names and/or expressions
79 to build multivariate statistics. In the latter case, separate
80 univariate statistics are built automatically for each
81 expression in the list.
84 The name (optionally schema-qualified) of the table containing
85 the column(s) the statistics are computed on; see ANALYZE for an
86 explanation of the handling of inheritance and partitions.
90 You must be the owner of a table to create a statistics object reading
91 it. Once created, however, the ownership of the statistics object is
92 independent of the underlying table(s).
94 Expression statistics are per-expression and are similar to creating an
95 index on the expression, except that they avoid the overhead of index
96 maintenance. Expression statistics are built automatically for each
97 expression in the statistics object definition.
99 Extended statistics are not currently used by the planner for
100 selectivity estimations made for table joins. This limitation will
101 likely be removed in a future version of PostgreSQL.
105 Create table t1 with two functionally dependent columns, i.e.,
106 knowledge of a value in the first column is sufficient for determining
107 the value in the other column. Then functional dependency statistics
108 are built on those columns:
114 INSERT INTO t1 SELECT i/100, i/500
115 FROM generate_series(1,1000000) s(i);
119 -- the number of matching rows will be drastically underestimated:
120 EXPLAIN ANALYZE SELECT * FROM t1 WHERE (a = 1) AND (b = 0);
122 CREATE STATISTICS s1 (dependencies) ON a, b FROM t1;
126 -- now the row count estimate is more accurate:
127 EXPLAIN ANALYZE SELECT * FROM t1 WHERE (a = 1) AND (b = 0);
129 Without functional-dependency statistics, the planner would assume that
130 the two WHERE conditions are independent, and would multiply their
131 selectivities together to arrive at a much-too-small row count
132 estimate. With such statistics, the planner recognizes that the WHERE
133 conditions are redundant and does not underestimate the row count.
135 Create table t2 with two perfectly correlated columns (containing
136 identical data), and an MCV list on those columns:
142 INSERT INTO t2 SELECT mod(i,100), mod(i,100)
143 FROM generate_series(1,1000000) s(i);
145 CREATE STATISTICS s2 (mcv) ON a, b FROM t2;
149 -- valid combination (found in MCV)
150 EXPLAIN ANALYZE SELECT * FROM t2 WHERE (a = 1) AND (b = 1);
152 -- invalid combination (not found in MCV)
153 EXPLAIN ANALYZE SELECT * FROM t2 WHERE (a = 1) AND (b = 2);
155 The MCV list gives the planner more detailed information about the
156 specific values that commonly appear in the table, as well as an upper
157 bound on the selectivities of combinations of values that do not appear
158 in the table, allowing it to generate better estimates in both cases.
160 Create table t3 with a single timestamp column, and run queries using
161 expressions on that column. Without extended statistics, the planner
162 has no information about the data distribution for the expressions, and
163 uses default estimates. The planner also does not realize that the
164 value of the date truncated to the month is fully determined by the
165 value of the date truncated to the day. Then expression and ndistinct
166 statistics are built on those two expressions:
171 INSERT INTO t3 SELECT i FROM generate_series('2020-01-01'::timestamp,
172 '2020-12-31'::timestamp,
173 '1 minute'::interval) s(i);
177 -- the number of matching rows will be drastically underestimated:
178 EXPLAIN ANALYZE SELECT * FROM t3
179 WHERE date_trunc('month', a) = '2020-01-01'::timestamp;
181 EXPLAIN ANALYZE SELECT * FROM t3
182 WHERE date_trunc('day', a) BETWEEN '2020-01-01'::timestamp
183 AND '2020-06-30'::timestamp;
185 EXPLAIN ANALYZE SELECT date_trunc('month', a), date_trunc('day', a)
186 FROM t3 GROUP BY 1, 2;
188 -- build ndistinct statistics on the pair of expressions (per-expression
189 -- statistics are built automatically)
190 CREATE STATISTICS s3 (ndistinct) ON date_trunc('month', a), date_trunc('day', a)
195 -- now the row count estimates are more accurate:
196 EXPLAIN ANALYZE SELECT * FROM t3
197 WHERE date_trunc('month', a) = '2020-01-01'::timestamp;
199 EXPLAIN ANALYZE SELECT * FROM t3
200 WHERE date_trunc('day', a) BETWEEN '2020-01-01'::timestamp
201 AND '2020-06-30'::timestamp;
203 EXPLAIN ANALYZE SELECT date_trunc('month', a), date_trunc('day', a)
204 FROM t3 GROUP BY 1, 2;
206 Without expression and ndistinct statistics, the planner has no
207 information about the number of distinct values for the expressions,
208 and has to rely on default estimates. The equality and range conditions
209 are assumed to have 0.5% selectivity, and the number of distinct values
210 in the expression is assumed to be the same as for the column (i.e.
211 unique). This results in a significant underestimate of the row count
212 in the first two queries. Moreover, the planner has no information
213 about the relationship between the expressions, so it assumes the two
214 WHERE and GROUP BY conditions are independent, and multiplies their
215 selectivities together to arrive at a severe overestimate of the group
216 count in the aggregate query. This is further exacerbated by the lack
217 of accurate statistics for the expressions, forcing the planner to use
218 a default ndistinct estimate for the expression derived from ndistinct
219 for the column. With such statistics, the planner recognizes that the
220 conditions are correlated, and arrives at much more accurate estimates.
224 There is no CREATE STATISTICS command in the SQL standard.
228 ALTER STATISTICS, DROP STATISTICS