begriffs open source - ai-pg/blob - full-docs/html/planner-stats.html

   1 <?xml version="1.0" encoding="UTF-8" standalone="no"?>
   2 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>14.2. Statistics Used by the Planner</title><link rel="stylesheet" type="text/css" href="stylesheet.css" /><link rev="made" href="pgsql-docs@lists.postgresql.org" /><meta name="generator" content="DocBook XSL Stylesheets Vsnapshot" /><link rel="prev" href="using-explain.html" title="14.1. Using EXPLAIN" /><link rel="next" href="explicit-joins.html" title="14.3. Controlling the Planner with Explicit JOIN Clauses" /></head><body id="docContent" class="container-fluid col-10"><div class="navheader"><table width="100%" summary="Navigation header"><tr><th colspan="5" align="center">14.2. Statistics Used by the Planner</th></tr><tr><td width="10%" align="left"><a accesskey="p" href="using-explain.html" title="14.1. Using EXPLAIN">Prev</a> </td><td width="10%" align="left"><a accesskey="u" href="performance-tips.html" title="Chapter 14. Performance Tips">Up</a></td><th width="60%" align="center">Chapter 14. Performance Tips</th><td width="10%" align="right"><a accesskey="h" href="index.html" title="PostgreSQL 18.0 Documentation">Home</a></td><td width="10%" align="right"> <a accesskey="n" href="explicit-joins.html" title="14.3. Controlling the Planner with Explicit JOIN Clauses">Next</a></td></tr></table><hr /></div><div class="sect1" id="PLANNER-STATS"><div class="titlepage"><div><div><h2 class="title" style="clear: both">14.2. Statistics Used by the Planner <a href="#PLANNER-STATS" class="id_link">#</a></h2></div></div></div><div class="toc"><dl class="toc"><dt><span class="sect2"><a href="planner-stats.html#PLANNER-STATS-SINGLE-COLUMN">14.2.1. Single-Column Statistics</a></span></dt><dt><span class="sect2"><a href="planner-stats.html#PLANNER-STATS-EXTENDED">14.2.2. Extended Statistics</a></span></dt></dl></div><a id="id-1.5.13.5.2" class="indexterm"></a><div class="sect2" id="PLANNER-STATS-SINGLE-COLUMN"><div class="titlepage"><div><div><h3 class="title">14.2.1. Single-Column Statistics <a href="#PLANNER-STATS-SINGLE-COLUMN" class="id_link">#</a></h3></div></div></div><p>
   3    As we saw in the previous section, the query planner needs to estimate
   4    the number of rows retrieved by a query in order to make good choices
   5    of query plans.  This section provides a quick look at the statistics
   6    that the system uses for these estimates.
   7   </p><p>
   8    One component of the statistics is the total number of entries in
   9    each table and index, as well as the number of disk blocks occupied
  10    by each table and index.  This information is kept in the table
  11    <a class="link" href="catalog-pg-class.html" title="52.11. pg_class"><code class="structname">pg_class</code></a>,
  12    in the columns <code class="structfield">reltuples</code> and
  13    <code class="structfield">relpages</code>.  We can look at it with
  14    queries similar to this one:
  15
  16 </p><pre class="screen">
  17 SELECT relname, relkind, reltuples, relpages
  18 FROM pg_class
  19 WHERE relname LIKE 'tenk1%';
  20
  21        relname        | relkind | reltuples | relpages
  22 ----------------------+---------+-----------+----------
  23  tenk1                | r       |     10000 |      345
  24  tenk1_hundred        | i       |     10000 |       11
  25  tenk1_thous_tenthous | i       |     10000 |       30
  26  tenk1_unique1        | i       |     10000 |       30
  27  tenk1_unique2        | i       |     10000 |       30
  28 (5 rows)
  29 </pre><p>
  30
  31    Here we can see that <code class="structname">tenk1</code> contains 10000
  32    rows, as do its indexes, but the indexes are (unsurprisingly) much
  33    smaller than the table.
  34   </p><p>
  35    For efficiency reasons, <code class="structfield">reltuples</code>
  36    and <code class="structfield">relpages</code> are not updated on-the-fly,
  37    and so they usually contain somewhat out-of-date values.
  38    They are updated by <code class="command">VACUUM</code>, <code class="command">ANALYZE</code>, and a
  39    few DDL commands such as <code class="command">CREATE INDEX</code>.  A <code class="command">VACUUM</code>
  40    or <code class="command">ANALYZE</code> operation that does not scan the entire table
  41    (which is commonly the case) will incrementally update the
  42    <code class="structfield">reltuples</code> count on the basis of the part
  43    of the table it did scan, resulting in an approximate value.
  44    In any case, the planner
  45    will scale the values it finds in <code class="structname">pg_class</code>
  46    to match the current physical table size, thus obtaining a closer
  47    approximation.
  48   </p><a id="id-1.5.13.5.3.5" class="indexterm"></a><p>
  49    Most queries retrieve only a fraction of the rows in a table, due
  50    to <code class="literal">WHERE</code> clauses that restrict the rows to be
  51    examined.  The planner thus needs to make an estimate of the
  52    <em class="firstterm">selectivity</em> of <code class="literal">WHERE</code> clauses, that is,
  53    the fraction of rows that match each condition in the
  54    <code class="literal">WHERE</code> clause.  The information used for this task is
  55    stored in the
  56    <a class="link" href="catalog-pg-statistic.html" title="52.51. pg_statistic"><code class="structname">pg_statistic</code></a>
  57    system catalog.  Entries in <code class="structname">pg_statistic</code>
  58    are updated by the <code class="command">ANALYZE</code> and <code class="command">VACUUM
  59    ANALYZE</code> commands, and are always approximate even when freshly
  60    updated.
  61   </p><a id="id-1.5.13.5.3.7" class="indexterm"></a><p>
  62    Rather than look at <code class="structname">pg_statistic</code> directly,
  63    it's better to look at its view
  64    <a class="link" href="view-pg-stats.html" title="53.29. pg_stats"><code class="structname">pg_stats</code></a>
  65    when examining the statistics manually.  <code class="structname">pg_stats</code>
  66    is designed to be more easily readable.  Furthermore,
  67    <code class="structname">pg_stats</code> is readable by all, whereas
  68    <code class="structname">pg_statistic</code> is only readable by a superuser.
  69    (This prevents unprivileged users from learning something about
  70    the contents of other people's tables from the statistics.  The
  71    <code class="structname">pg_stats</code> view is restricted to show only
  72    rows about tables that the current user can read.)
  73    For example, we might do:
  74
  75 </p><pre class="screen">
  76 SELECT attname, inherited, n_distinct,
  77        array_to_string(most_common_vals, E'\n') as most_common_vals
  78 FROM pg_stats
  79 WHERE tablename = 'road';
  80
  81  attname | inherited | n_distinct |          most_common_vals
  82 ---------+-----------+------------+------------------------------------
  83  name    | f         | -0.5681108 | I- 580                        Ramp+
  84          |           |            | I- 880                        Ramp+
  85          |           |            | Sp Railroad                       +
  86          |           |            | I- 580                            +
  87          |           |            | I- 680                        Ramp+
  88          |           |            | I- 80                         Ramp+
  89          |           |            | 14th                          St  +
  90          |           |            | I- 880                            +
  91          |           |            | Mac Arthur                    Blvd+
  92          |           |            | Mission                       Blvd+
  93 ...
  94  name    | t         |    -0.5125 | I- 580                        Ramp+
  95          |           |            | I- 880                        Ramp+
  96          |           |            | I- 580                            +
  97          |           |            | I- 680                        Ramp+
  98          |           |            | I- 80                         Ramp+
  99          |           |            | Sp Railroad                       +
 100          |           |            | I- 880                            +
 101          |           |            | State Hwy 13                  Ramp+
 102          |           |            | I- 80                             +
 103          |           |            | State Hwy 24                  Ramp+
 104 ...
 105  thepath | f         |          0 |
 106  thepath | t         |          0 |
 107 (4 rows)
 108 </pre><p>
 109
 110    Note that two rows are displayed for the same column, one corresponding
 111    to the complete inheritance hierarchy starting at the
 112    <code class="literal">road</code> table (<code class="literal">inherited</code>=<code class="literal">t</code>),
 113    and another one including only the <code class="literal">road</code> table itself
 114    (<code class="literal">inherited</code>=<code class="literal">f</code>).
 115    (For brevity, we have only shown the first ten most-common values for
 116    the <code class="literal">name</code> column.)
 117   </p><p>
 118    The amount of information stored in <code class="structname">pg_statistic</code>
 119    by <code class="command">ANALYZE</code>, in particular the maximum number of entries in the
 120    <code class="structfield">most_common_vals</code> and <code class="structfield">histogram_bounds</code>
 121    arrays for each column, can be set on a
 122    column-by-column basis using the <code class="command">ALTER TABLE SET STATISTICS</code>
 123    command, or globally by setting the
 124    <a class="xref" href="runtime-config-query.html#GUC-DEFAULT-STATISTICS-TARGET">default_statistics_target</a> configuration variable.
 125    The default limit is presently 100 entries.  Raising the limit
 126    might allow more accurate planner estimates to be made, particularly for
 127    columns with irregular data distributions, at the price of consuming
 128    more space in <code class="structname">pg_statistic</code> and slightly more
 129    time to compute the estimates.  Conversely, a lower limit might be
 130    sufficient for columns with simple data distributions.
 131   </p><p>
 132    Further details about the planner's use of statistics can be found in
 133    <a class="xref" href="planner-stats-details.html" title="Chapter 69. How the Planner Uses Statistics">Chapter 69</a>.
 134   </p></div><div class="sect2" id="PLANNER-STATS-EXTENDED"><div class="titlepage"><div><div><h3 class="title">14.2.2. Extended Statistics <a href="#PLANNER-STATS-EXTENDED" class="id_link">#</a></h3></div></div></div><a id="id-1.5.13.5.4.2" class="indexterm"></a><a id="id-1.5.13.5.4.3" class="indexterm"></a><a id="id-1.5.13.5.4.4" class="indexterm"></a><a id="id-1.5.13.5.4.5" class="indexterm"></a><p>
 135     It is common to see slow queries running bad execution plans because
 136     multiple columns used in the query clauses are correlated.
 137     The planner normally assumes that multiple conditions
 138     are independent of each other,
 139     an assumption that does not hold when column values are correlated.
 140     Regular statistics, because of their per-individual-column nature,
 141     cannot capture any knowledge about cross-column correlation.
 142     However, <span class="productname">PostgreSQL</span> has the ability to compute
 143     <em class="firstterm">multivariate statistics</em>, which can capture
 144     such information.
 145    </p><p>
 146     Because the number of possible column combinations is very large,
 147     it's impractical to compute multivariate statistics automatically.
 148     Instead, <em class="firstterm">extended statistics objects</em>, more often
 149     called just <em class="firstterm">statistics objects</em>, can be created to instruct
 150     the server to obtain statistics across interesting sets of columns.
 151    </p><p>
 152     Statistics objects are created using the
 153     <a class="link" href="sql-createstatistics.html" title="CREATE STATISTICS"><code class="command">CREATE STATISTICS</code></a> command.
 154     Creation of such an object merely creates a catalog entry expressing
 155     interest in the statistics.  Actual data collection is performed
 156     by <code class="command">ANALYZE</code> (either a manual command, or background
 157     auto-analyze).  The collected values can be examined in the
 158     <a class="link" href="catalog-pg-statistic-ext-data.html" title="52.53. pg_statistic_ext_data"><code class="structname">pg_statistic_ext_data</code></a>
 159     catalog.
 160    </p><p>
 161     <code class="command">ANALYZE</code> computes extended statistics based on the same
 162     sample of table rows that it takes for computing regular single-column
 163     statistics.  Since the sample size is increased by increasing the
 164     statistics target for the table or any of its columns (as described in
 165     the previous section), a larger statistics target will normally result in
 166     more accurate extended statistics, as well as more time spent calculating
 167     them.
 168    </p><p>
 169     The following subsections describe the kinds of extended statistics
 170     that are currently supported.
 171    </p><div class="sect3" id="PLANNER-STATS-EXTENDED-FUNCTIONAL-DEPS"><div class="titlepage"><div><div><h4 class="title">14.2.2.1. Functional Dependencies <a href="#PLANNER-STATS-EXTENDED-FUNCTIONAL-DEPS" class="id_link">#</a></h4></div></div></div><p>
 172      The simplest kind of extended statistics tracks <em class="firstterm">functional
 173      dependencies</em>, a concept used in definitions of database normal forms.
 174      We say that column <code class="structfield">b</code> is functionally dependent on
 175      column <code class="structfield">a</code> if knowledge of the value of
 176      <code class="structfield">a</code> is sufficient to determine the value
 177      of <code class="structfield">b</code>, that is there are no two rows having the same value
 178      of <code class="structfield">a</code> but different values of <code class="structfield">b</code>.
 179      In a fully normalized database, functional dependencies should exist
 180      only on primary keys and superkeys. However, in practice many data sets
 181      are not fully normalized for various reasons; intentional
 182      denormalization for performance reasons is a common example.
 183      Even in a fully normalized database, there may be partial correlation
 184      between some columns, which can be expressed as partial functional
 185      dependency.
 186     </p><p>
 187      The existence of functional dependencies directly affects the accuracy
 188      of estimates in certain queries.  If a query contains conditions on
 189      both the independent and the dependent column(s), the
 190      conditions on the dependent columns do not further reduce the result
 191      size; but without knowledge of the functional dependency, the query
 192      planner will assume that the conditions are independent, resulting
 193      in underestimating the result size.
 194     </p><p>
 195      To inform the planner about functional dependencies, <code class="command">ANALYZE</code>
 196      can collect measurements of cross-column dependency. Assessing the
 197      degree of dependency between all sets of columns would be prohibitively
 198      expensive, so data collection is limited to those groups of columns
 199      appearing together in a statistics object defined with
 200      the <code class="literal">dependencies</code> option.  It is advisable to create
 201      <code class="literal">dependencies</code> statistics only for column groups that are
 202      strongly correlated, to avoid unnecessary overhead in both
 203      <code class="command">ANALYZE</code> and later query planning.
 204     </p><p>
 205      Here is an example of collecting functional-dependency statistics:
 206 </p><pre class="programlisting">
 207 CREATE STATISTICS stts (dependencies) ON city, zip FROM zipcodes;
 208
 209 ANALYZE zipcodes;
 210
 211 SELECT stxname, stxkeys, stxddependencies
 212   FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid)
 213   WHERE stxname = 'stts';
 214  stxname | stxkeys |             stxddependencies
 215 ---------+---------+------------------------------------------
 216  stts    | 1 5     | {"1 =&gt; 5": 1.000000, "5 =&gt; 1": 0.423130}
 217 (1 row)
 218 </pre><p>
 219      Here it can be seen that column 1 (zip code) fully determines column
 220      5 (city) so the coefficient is 1.0, while city only determines zip code
 221      about 42% of the time, meaning that there are many cities (58%) that are
 222      represented by more than a single ZIP code.
 223     </p><p>
 224      When computing the selectivity for a query involving functionally
 225      dependent columns, the planner adjusts the per-condition selectivity
 226      estimates using the dependency coefficients so as not to produce
 227      an underestimate.
 228     </p><div class="sect4" id="PLANNER-STATS-EXTENDED-FUNCTIONAL-DEPS-LIMITS"><div class="titlepage"><div><div><h5 class="title">14.2.2.1.1. Limitations of Functional Dependencies <a href="#PLANNER-STATS-EXTENDED-FUNCTIONAL-DEPS-LIMITS" class="id_link">#</a></h5></div></div></div><p>
 229       Functional dependencies are currently only applied when considering
 230       simple equality conditions that compare columns to constant values,
 231       and <code class="literal">IN</code> clauses with constant values.
 232       They are not used to improve estimates for equality conditions
 233       comparing two columns or comparing a column to an expression, nor for
 234       range clauses, <code class="literal">LIKE</code> or any other type of condition.
 235      </p><p>
 236       When estimating with functional dependencies, the planner assumes that
 237       conditions on the involved columns are compatible and hence redundant.
 238       If they are incompatible, the correct estimate would be zero rows, but
 239       that possibility is not considered.  For example, given a query like
 240 </p><pre class="programlisting">
 241 SELECT * FROM zipcodes WHERE city = 'San Francisco' AND zip = '94105';
 242 </pre><p>
 243       the planner will disregard the <code class="structfield">city</code> clause as not
 244       changing the selectivity, which is correct.  However, it will make
 245       the same assumption about
 246 </p><pre class="programlisting">
 247 SELECT * FROM zipcodes WHERE city = 'San Francisco' AND zip = '90210';
 248 </pre><p>
 249       even though there will really be zero rows satisfying this query.
 250       Functional dependency statistics do not provide enough information
 251       to conclude that, however.
 252      </p><p>
 253       In many practical situations, this assumption is usually satisfied;
 254       for example, there might be a GUI in the application that only allows
 255       selecting compatible city and ZIP code values to use in a query.
 256       But if that's not the case, functional dependencies may not be a viable
 257       option.
 258      </p></div></div><div class="sect3" id="PLANNER-STATS-EXTENDED-N-DISTINCT-COUNTS"><div class="titlepage"><div><div><h4 class="title">14.2.2.2. Multivariate N-Distinct Counts <a href="#PLANNER-STATS-EXTENDED-N-DISTINCT-COUNTS" class="id_link">#</a></h4></div></div></div><p>
 259      Single-column statistics store the number of distinct values in each
 260      column.  Estimates of the number of distinct values when combining more
 261      than one column (for example, for <code class="literal">GROUP BY a, b</code>) are
 262      frequently wrong when the planner only has single-column statistical
 263      data, causing it to select bad plans.
 264     </p><p>
 265      To improve such estimates, <code class="command">ANALYZE</code> can collect n-distinct
 266      statistics for groups of columns.  As before, it's impractical to do
 267      this for every possible column grouping, so data is collected only for
 268      those groups of columns appearing together in a statistics object
 269      defined with the <code class="literal">ndistinct</code> option.  Data will be collected
 270      for each possible combination of two or more columns from the set of
 271      listed columns.
 272     </p><p>
 273      Continuing the previous example, the n-distinct counts in a
 274      table of ZIP codes might look like the following:
 275 </p><pre class="programlisting">
 276 CREATE STATISTICS stts2 (ndistinct) ON city, state, zip FROM zipcodes;
 277
 278 ANALYZE zipcodes;
 279
 280 SELECT stxkeys AS k, stxdndistinct AS nd
 281   FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid)
 282   WHERE stxname = 'stts2';
 283 -[ RECORD 1 ]--------------------------------------------------------
 284 k  | 1 2 5
 285 nd | {"1, 2": 33178, "1, 5": 33178, "2, 5": 27435, "1, 2, 5": 33178}
 286 (1 row)
 287 </pre><p>
 288      This indicates that there are three combinations of columns that
 289      have 33178 distinct values: ZIP code and state; ZIP code and city;
 290      and ZIP code, city and state (the fact that they are all equal is
 291      expected given that ZIP code alone is unique in this table).  On the
 292      other hand, the combination of city and state has only 27435 distinct
 293      values.
 294     </p><p>
 295      It's advisable to create <code class="literal">ndistinct</code> statistics objects only
 296      on combinations of columns that are actually used for grouping, and
 297      for which misestimation of the number of groups is resulting in bad
 298      plans.  Otherwise, the <code class="command">ANALYZE</code> cycles are just wasted.
 299     </p></div><div class="sect3" id="PLANNER-STATS-EXTENDED-MCV-LISTS"><div class="titlepage"><div><div><h4 class="title">14.2.2.3. Multivariate MCV Lists <a href="#PLANNER-STATS-EXTENDED-MCV-LISTS" class="id_link">#</a></h4></div></div></div><p>
 300      Another type of statistic stored for each column are most-common value
 301      lists.  This allows very accurate estimates for individual columns, but
 302      may result in significant misestimates for queries with conditions on
 303      multiple columns.
 304     </p><p>
 305      To improve such estimates, <code class="command">ANALYZE</code> can collect MCV
 306      lists on combinations of columns.  Similarly to functional dependencies
 307      and n-distinct coefficients, it's impractical to do this for every
 308      possible column grouping.  Even more so in this case, as the MCV list
 309      (unlike functional dependencies and n-distinct coefficients) does store
 310      the common column values.  So data is collected only for those groups
 311      of columns appearing together in a statistics object defined with the
 312      <code class="literal">mcv</code> option.
 313     </p><p>
 314      Continuing the previous example, the MCV list for a table of ZIP codes
 315      might look like the following (unlike for simpler types of statistics,
 316      a function is required for inspection of MCV contents):
 317
 318 </p><pre class="programlisting">
 319 CREATE STATISTICS stts3 (mcv) ON city, state FROM zipcodes;
 320
 321 ANALYZE zipcodes;
 322
 323 SELECT m.* FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid),
 324                 pg_mcv_list_items(stxdmcv) m WHERE stxname = 'stts3';
 325
 326  index |         values         | nulls | frequency | base_frequency
 327 -------+------------------------+-------+-----------+----------------
 328      0 | {Washington, DC}       | {f,f} |  0.003467 |        2.7e-05
 329      1 | {Apo, AE}              | {f,f} |  0.003067 |        1.9e-05
 330      2 | {Houston, TX}          | {f,f} |  0.002167 |       0.000133
 331      3 | {El Paso, TX}          | {f,f} |     0.002 |       0.000113
 332      4 | {New York, NY}         | {f,f} |  0.001967 |       0.000114
 333      5 | {Atlanta, GA}          | {f,f} |  0.001633 |        3.3e-05
 334      6 | {Sacramento, CA}       | {f,f} |  0.001433 |        7.8e-05
 335      7 | {Miami, FL}            | {f,f} |    0.0014 |          6e-05
 336      8 | {Dallas, TX}           | {f,f} |  0.001367 |        8.8e-05
 337      9 | {Chicago, IL}          | {f,f} |  0.001333 |        5.1e-05
 338    ...
 339 (99 rows)
 340 </pre><p>
 341      This indicates that the most common combination of city and state is
 342      Washington in DC, with actual frequency (in the sample) about 0.35%.
 343      The base frequency of the combination (as computed from the simple
 344      per-column frequencies) is only 0.0027%, resulting in two orders of
 345      magnitude under-estimates.
 346     </p><p>
 347      It's advisable to create <acronym class="acronym">MCV</acronym> statistics objects only
 348      on combinations of columns that are actually used in conditions together,
 349      and for which misestimation of the number of groups is resulting in bad
 350      plans.  Otherwise, the <code class="command">ANALYZE</code> and planning cycles
 351      are just wasted.
 352     </p></div></div></div><div class="navfooter"><hr /><table width="100%" summary="Navigation footer"><tr><td width="40%" align="left"><a accesskey="p" href="using-explain.html" title="14.1. Using EXPLAIN">Prev</a> </td><td width="20%" align="center"><a accesskey="u" href="performance-tips.html" title="Chapter 14. Performance Tips">Up</a></td><td width="40%" align="right"> <a accesskey="n" href="explicit-joins.html" title="14.3. Controlling the Planner with Explicit JOIN Clauses">Next</a></td></tr><tr><td width="40%" align="left" valign="top">14.1. Using <code class="command">EXPLAIN</code> </td><td width="20%" align="center"><a accesskey="h" href="index.html" title="PostgreSQL 18.0 Documentation">Home</a></td><td width="40%" align="right" valign="top"> 14.3. Controlling the Planner with Explicit <code class="literal">JOIN</code> Clauses</td></tr></table></div></body></html>