begriffs open source - ai-pg/blob - full-docs/src/sgml/html/sql-createstatistics.html

   1 <?xml version="1.0" encoding="UTF-8" standalone="no"?>
   2 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>CREATE STATISTICS</title><link rel="stylesheet" type="text/css" href="stylesheet.css" /><link rev="made" href="pgsql-docs@lists.postgresql.org" /><meta name="generator" content="DocBook XSL Stylesheets Vsnapshot" /><link rel="prev" href="sql-createserver.html" title="CREATE SERVER" /><link rel="next" href="sql-createsubscription.html" title="CREATE SUBSCRIPTION" /></head><body id="docContent" class="container-fluid col-10"><div class="navheader"><table width="100%" summary="Navigation header"><tr><th colspan="5" align="center">CREATE STATISTICS</th></tr><tr><td width="10%" align="left"><a accesskey="p" href="sql-createserver.html" title="CREATE SERVER">Prev</a> </td><td width="10%" align="left"><a accesskey="u" href="sql-commands.html" title="SQL Commands">Up</a></td><th width="60%" align="center">SQL Commands</th><td width="10%" align="right"><a accesskey="h" href="index.html" title="PostgreSQL 18.0 Documentation">Home</a></td><td width="10%" align="right"> <a accesskey="n" href="sql-createsubscription.html" title="CREATE SUBSCRIPTION">Next</a></td></tr></table><hr /></div><div class="refentry" id="SQL-CREATESTATISTICS"><div class="titlepage"></div><a id="id-1.9.3.83.1" class="indexterm"></a><div class="refnamediv"><h2><span class="refentrytitle">CREATE STATISTICS</span></h2><p>CREATE STATISTICS — define extended statistics</p></div><div class="refsynopsisdiv"><h2>Synopsis</h2><pre class="synopsis">
   3 CREATE STATISTICS [ [ IF NOT EXISTS ] <em class="replaceable"><code>statistics_name</code></em> ]
   4     ON ( <em class="replaceable"><code>expression</code></em> )
   5     FROM <em class="replaceable"><code>table_name</code></em>
   6
   7 CREATE STATISTICS [ [ IF NOT EXISTS ] <em class="replaceable"><code>statistics_name</code></em> ]
   8     [ ( <em class="replaceable"><code>statistics_kind</code></em> [, ... ] ) ]
   9     ON { <em class="replaceable"><code>column_name</code></em> | ( <em class="replaceable"><code>expression</code></em> ) }, { <em class="replaceable"><code>column_name</code></em> | ( <em class="replaceable"><code>expression</code></em> ) } [, ...]
  10     FROM <em class="replaceable"><code>table_name</code></em>
  11 </pre></div><div class="refsect1" id="SQL-CREATESTATISTICS-DESCRIPTION"><h2>Description</h2><p>
  12    <code class="command">CREATE STATISTICS</code> will create a new extended statistics
  13    object tracking data about the specified table, foreign table or
  14    materialized view.  The statistics object will be created in the current
  15    database and will be owned by the user issuing the command.
  16   </p><p>
  17    The <code class="command">CREATE STATISTICS</code> command has two basic forms. The
  18    first form allows univariate statistics for a single expression to be
  19    collected, providing benefits similar to an expression index without the
  20    overhead of index maintenance.  This form does not allow the statistics
  21    kind to be specified, since the various statistics kinds refer only to
  22    multivariate statistics.  The second form of the command allows
  23    multivariate statistics on multiple columns and/or expressions to be
  24    collected, optionally specifying which statistics kinds to include.  This
  25    form will also automatically cause univariate statistics to be collected on
  26    any expressions included in the list.
  27   </p><p>
  28    If a schema name is given (for example, <code class="literal">CREATE STATISTICS
  29    myschema.mystat ...</code>) then the statistics object is created in the
  30    specified schema.  Otherwise it is created in the current schema.
  31    If given, the name of the statistics object must be distinct from the name
  32    of any other statistics object in the same schema.
  33   </p></div><div class="refsect1" id="id-1.9.3.83.6"><h2>Parameters</h2><div class="variablelist"><dl class="variablelist"><dt><span class="term"><code class="literal">IF NOT EXISTS</code></span></dt><dd><p>
  34       Do not throw an error if a statistics object with the same name already
  35       exists.  A notice is issued in this case.  Note that only the name of
  36       the statistics object is considered here, not the details of its
  37       definition.
  38       Statistics name is required when <code class="literal">IF NOT EXISTS</code> is specified.
  39      </p></dd><dt><span class="term"><em class="replaceable"><code>statistics_name</code></em></span></dt><dd><p>
  40       The name (optionally schema-qualified) of the statistics object to be
  41       created.
  42       If the name is omitted, <span class="productname">PostgreSQL</span> chooses a
  43       suitable name based on the parent table's name and the defined column
  44       name(s) and/or expression(s).
  45      </p></dd><dt><span class="term"><em class="replaceable"><code>statistics_kind</code></em></span></dt><dd><p>
  46       A multivariate statistics kind to be computed in this statistics object.
  47       Currently supported kinds are
  48       <code class="literal">ndistinct</code>, which enables n-distinct statistics,
  49       <code class="literal">dependencies</code>, which enables functional
  50       dependency statistics, and <code class="literal">mcv</code> which enables
  51       most-common values lists.
  52       If this clause is omitted, all supported statistics kinds are
  53       included in the statistics object. Univariate expression statistics are
  54       built automatically if the statistics definition includes any complex
  55       expressions rather than just simple column references.
  56       For more information, see <a class="xref" href="planner-stats.html#PLANNER-STATS-EXTENDED" title="14.2.2. Extended Statistics">Section 14.2.2</a>
  57       and <a class="xref" href="multivariate-statistics-examples.html" title="69.2. Multivariate Statistics Examples">Section 69.2</a>.
  58      </p></dd><dt><span class="term"><em class="replaceable"><code>column_name</code></em></span></dt><dd><p>
  59       The name of a table column to be covered by the computed statistics.
  60       This is only allowed when building multivariate statistics.  At least
  61       two column names or expressions must be specified, and their order is
  62       not significant.
  63      </p></dd><dt><span class="term"><em class="replaceable"><code>expression</code></em></span></dt><dd><p>
  64       An expression to be covered by the computed statistics.  This may be
  65       used to build univariate statistics on a single expression, or as part
  66       of a list of multiple column names and/or expressions to build
  67       multivariate statistics.  In the latter case, separate univariate
  68       statistics are built automatically for each expression in the list.
  69      </p></dd><dt><span class="term"><em class="replaceable"><code>table_name</code></em></span></dt><dd><p>
  70       The name (optionally schema-qualified) of the table containing the
  71       column(s) the statistics are computed on;  see <a class="xref" href="sql-analyze.html" title="ANALYZE"><span class="refentrytitle">ANALYZE</span></a> for an explanation of the handling of
  72       inheritance and partitions.
  73      </p></dd></dl></div></div><div class="refsect1" id="id-1.9.3.83.7"><h2>Notes</h2><p>
  74    You must be the owner of a table to create a statistics object
  75    reading it.  Once created, however, the ownership of the statistics
  76    object is independent of the underlying table(s).
  77   </p><p>
  78    Expression statistics are per-expression and are similar to creating an
  79    index on the expression, except that they avoid the overhead of index
  80    maintenance. Expression statistics are built automatically for each
  81    expression in the statistics object definition.
  82   </p><p>
  83    Extended statistics are not currently used by the planner for selectivity
  84    estimations made for table joins.  This limitation will likely be removed
  85    in a future version of <span class="productname">PostgreSQL</span>.
  86   </p></div><div class="refsect1" id="SQL-CREATESTATISTICS-EXAMPLES"><h2>Examples</h2><p>
  87    Create table <code class="structname">t1</code> with two functionally dependent columns, i.e.,
  88    knowledge of a value in the first column is sufficient for determining the
  89    value in the other column. Then functional dependency statistics are built
  90    on those columns:
  91
  92 </p><pre class="programlisting">
  93 CREATE TABLE t1 (
  94     a   int,
  95     b   int
  96 );
  97
  98 INSERT INTO t1 SELECT i/100, i/500
  99                  FROM generate_series(1,1000000) s(i);
 100
 101 ANALYZE t1;
 102
 103 -- the number of matching rows will be drastically underestimated:
 104 EXPLAIN ANALYZE SELECT * FROM t1 WHERE (a = 1) AND (b = 0);
 105
 106 CREATE STATISTICS s1 (dependencies) ON a, b FROM t1;
 107
 108 ANALYZE t1;
 109
 110 -- now the row count estimate is more accurate:
 111 EXPLAIN ANALYZE SELECT * FROM t1 WHERE (a = 1) AND (b = 0);
 112 </pre><p>
 113
 114    Without functional-dependency statistics, the planner would assume
 115    that the two <code class="literal">WHERE</code> conditions are independent, and would
 116    multiply their selectivities together to arrive at a much-too-small
 117    row count estimate.
 118    With such statistics, the planner recognizes that the <code class="literal">WHERE</code>
 119    conditions are redundant and does not underestimate the row count.
 120   </p><p>
 121    Create table <code class="structname">t2</code> with two perfectly correlated columns
 122    (containing identical data), and an MCV list on those columns:
 123
 124 </p><pre class="programlisting">
 125 CREATE TABLE t2 (
 126     a   int,
 127     b   int
 128 );
 129
 130 INSERT INTO t2 SELECT mod(i,100), mod(i,100)
 131                  FROM generate_series(1,1000000) s(i);
 132
 133 CREATE STATISTICS s2 (mcv) ON a, b FROM t2;
 134
 135 ANALYZE t2;
 136
 137 -- valid combination (found in MCV)
 138 EXPLAIN ANALYZE SELECT * FROM t2 WHERE (a = 1) AND (b = 1);
 139
 140 -- invalid combination (not found in MCV)
 141 EXPLAIN ANALYZE SELECT * FROM t2 WHERE (a = 1) AND (b = 2);
 142 </pre><p>
 143
 144    The MCV list gives the planner more detailed information about the
 145    specific values that commonly appear in the table, as well as an upper
 146    bound on the selectivities of combinations of values that do not appear
 147    in the table, allowing it to generate better estimates in both cases.
 148   </p><p>
 149    Create table <code class="structname">t3</code> with a single timestamp column,
 150    and run queries using expressions on that column.  Without extended
 151    statistics, the planner has no information about the data distribution for
 152    the expressions, and uses default estimates.  The planner also does not
 153    realize that the value of the date truncated to the month is fully
 154    determined by the value of the date truncated to the day. Then expression
 155    and ndistinct statistics are built on those two expressions:
 156
 157 </p><pre class="programlisting">
 158 CREATE TABLE t3 (
 159     a   timestamp
 160 );
 161
 162 INSERT INTO t3 SELECT i FROM generate_series('2020-01-01'::timestamp,
 163                                              '2020-12-31'::timestamp,
 164                                              '1 minute'::interval) s(i);
 165
 166 ANALYZE t3;
 167
 168 -- the number of matching rows will be drastically underestimated:
 169 EXPLAIN ANALYZE SELECT * FROM t3
 170   WHERE date_trunc('month', a) = '2020-01-01'::timestamp;
 171
 172 EXPLAIN ANALYZE SELECT * FROM t3
 173   WHERE date_trunc('day', a) BETWEEN '2020-01-01'::timestamp
 174                                  AND '2020-06-30'::timestamp;
 175
 176 EXPLAIN ANALYZE SELECT date_trunc('month', a), date_trunc('day', a)
 177    FROM t3 GROUP BY 1, 2;
 178
 179 -- build ndistinct statistics on the pair of expressions (per-expression
 180 -- statistics are built automatically)
 181 CREATE STATISTICS s3 (ndistinct) ON date_trunc('month', a), date_trunc('day', a) FROM t3;
 182
 183 ANALYZE t3;
 184
 185 -- now the row count estimates are more accurate:
 186 EXPLAIN ANALYZE SELECT * FROM t3
 187   WHERE date_trunc('month', a) = '2020-01-01'::timestamp;
 188
 189 EXPLAIN ANALYZE SELECT * FROM t3
 190   WHERE date_trunc('day', a) BETWEEN '2020-01-01'::timestamp
 191                                  AND '2020-06-30'::timestamp;
 192
 193 EXPLAIN ANALYZE SELECT date_trunc('month', a), date_trunc('day', a)
 194    FROM t3 GROUP BY 1, 2;
 195 </pre><p>
 196
 197    Without expression and ndistinct statistics, the planner has no information
 198    about the number of distinct values for the expressions, and has to rely
 199    on default estimates. The equality and range conditions are assumed to have
 200    0.5% selectivity, and the number of distinct values in the expression is
 201    assumed to be the same as for the column (i.e. unique). This results in a
 202    significant underestimate of the row count in the first two queries. Moreover,
 203    the planner has no information about the relationship between the expressions,
 204    so it assumes the two <code class="literal">WHERE</code> and <code class="literal">GROUP BY</code>
 205    conditions are independent, and multiplies their selectivities together to
 206    arrive at a severe overestimate of the group count in the aggregate query.
 207    This is further exacerbated by the lack of accurate statistics for the
 208    expressions, forcing the planner to use a default ndistinct estimate for the
 209    expression derived from ndistinct for the column. With such statistics, the
 210    planner recognizes that the conditions are correlated, and arrives at much
 211    more accurate estimates.
 212   </p></div><div class="refsect1" id="id-1.9.3.83.9"><h2>Compatibility</h2><p>
 213    There is no <code class="command">CREATE STATISTICS</code> command in the SQL standard.
 214   </p></div><div class="refsect1" id="id-1.9.3.83.10"><h2>See Also</h2><span class="simplelist"><a class="xref" href="sql-alterstatistics.html" title="ALTER STATISTICS"><span class="refentrytitle">ALTER STATISTICS</span></a>, <a class="xref" href="sql-dropstatistics.html" title="DROP STATISTICS"><span class="refentrytitle">DROP STATISTICS</span></a></span></div></div><div class="navfooter"><hr /><table width="100%" summary="Navigation footer"><tr><td width="40%" align="left"><a accesskey="p" href="sql-createserver.html" title="CREATE SERVER">Prev</a> </td><td width="20%" align="center"><a accesskey="u" href="sql-commands.html" title="SQL Commands">Up</a></td><td width="40%" align="right"> <a accesskey="n" href="sql-createsubscription.html" title="CREATE SUBSCRIPTION">Next</a></td></tr><tr><td width="40%" align="left" valign="top">CREATE SERVER </td><td width="20%" align="center"><a accesskey="h" href="index.html" title="PostgreSQL 18.0 Documentation">Home</a></td><td width="40%" align="right" valign="top"> CREATE SUBSCRIPTION</td></tr></table></div></body></html>