begriffs open source - ai-pg/blob - full-docs/txt/planner-stats.txt

   1
   2 14.2. Statistics Used by the Planner #
   3
   4    14.2.1. Single-Column Statistics
   5    14.2.2. Extended Statistics
   6
   7 14.2.1. Single-Column Statistics #
   8
   9    As we saw in the previous section, the query planner needs to estimate
  10    the number of rows retrieved by a query in order to make good choices
  11    of query plans. This section provides a quick look at the statistics
  12    that the system uses for these estimates.
  13
  14    One component of the statistics is the total number of entries in each
  15    table and index, as well as the number of disk blocks occupied by each
  16    table and index. This information is kept in the table pg_class, in the
  17    columns reltuples and relpages. We can look at it with queries similar
  18    to this one:
  19 SELECT relname, relkind, reltuples, relpages
  20 FROM pg_class
  21 WHERE relname LIKE 'tenk1%';
  22
  23        relname        | relkind | reltuples | relpages
  24 ----------------------+---------+-----------+----------
  25  tenk1                | r       |     10000 |      345
  26  tenk1_hundred        | i       |     10000 |       11
  27  tenk1_thous_tenthous | i       |     10000 |       30
  28  tenk1_unique1        | i       |     10000 |       30
  29  tenk1_unique2        | i       |     10000 |       30
  30 (5 rows)
  31
  32    Here we can see that tenk1 contains 10000 rows, as do its indexes, but
  33    the indexes are (unsurprisingly) much smaller than the table.
  34
  35    For efficiency reasons, reltuples and relpages are not updated
  36    on-the-fly, and so they usually contain somewhat out-of-date values.
  37    They are updated by VACUUM, ANALYZE, and a few DDL commands such as
  38    CREATE INDEX. A VACUUM or ANALYZE operation that does not scan the
  39    entire table (which is commonly the case) will incrementally update the
  40    reltuples count on the basis of the part of the table it did scan,
  41    resulting in an approximate value. In any case, the planner will scale
  42    the values it finds in pg_class to match the current physical table
  43    size, thus obtaining a closer approximation.
  44
  45    Most queries retrieve only a fraction of the rows in a table, due to
  46    WHERE clauses that restrict the rows to be examined. The planner thus
  47    needs to make an estimate of the selectivity of WHERE clauses, that is,
  48    the fraction of rows that match each condition in the WHERE clause. The
  49    information used for this task is stored in the pg_statistic system
  50    catalog. Entries in pg_statistic are updated by the ANALYZE and VACUUM
  51    ANALYZE commands, and are always approximate even when freshly updated.
  52
  53    Rather than look at pg_statistic directly, it's better to look at its
  54    view pg_stats when examining the statistics manually. pg_stats is
  55    designed to be more easily readable. Furthermore, pg_stats is readable
  56    by all, whereas pg_statistic is only readable by a superuser. (This
  57    prevents unprivileged users from learning something about the contents
  58    of other people's tables from the statistics. The pg_stats view is
  59    restricted to show only rows about tables that the current user can
  60    read.) For example, we might do:
  61 SELECT attname, inherited, n_distinct,
  62        array_to_string(most_common_vals, E'\n') as most_common_vals
  63 FROM pg_stats
  64 WHERE tablename = 'road';
  65
  66  attname | inherited | n_distinct |          most_common_vals
  67 ---------+-----------+------------+------------------------------------
  68  name    | f         | -0.5681108 | I- 580                        Ramp+
  69          |           |            | I- 880                        Ramp+
  70          |           |            | Sp Railroad                       +
  71          |           |            | I- 580                            +
  72          |           |            | I- 680                        Ramp+
  73          |           |            | I- 80                         Ramp+
  74          |           |            | 14th                          St  +
  75          |           |            | I- 880                            +
  76          |           |            | Mac Arthur                    Blvd+
  77          |           |            | Mission                       Blvd+
  78 ...
  79  name    | t         |    -0.5125 | I- 580                        Ramp+
  80          |           |            | I- 880                        Ramp+
  81          |           |            | I- 580                            +
  82          |           |            | I- 680                        Ramp+
  83          |           |            | I- 80                         Ramp+
  84          |           |            | Sp Railroad                       +
  85          |           |            | I- 880                            +
  86          |           |            | State Hwy 13                  Ramp+
  87          |           |            | I- 80                             +
  88          |           |            | State Hwy 24                  Ramp+
  89 ...
  90  thepath | f         |          0 |
  91  thepath | t         |          0 |
  92 (4 rows)
  93
  94    Note that two rows are displayed for the same column, one corresponding
  95    to the complete inheritance hierarchy starting at the road table
  96    (inherited=t), and another one including only the road table itself
  97    (inherited=f). (For brevity, we have only shown the first ten
  98    most-common values for the name column.)
  99
 100    The amount of information stored in pg_statistic by ANALYZE, in
 101    particular the maximum number of entries in the most_common_vals and
 102    histogram_bounds arrays for each column, can be set on a
 103    column-by-column basis using the ALTER TABLE SET STATISTICS command, or
 104    globally by setting the default_statistics_target configuration
 105    variable. The default limit is presently 100 entries. Raising the limit
 106    might allow more accurate planner estimates to be made, particularly
 107    for columns with irregular data distributions, at the price of
 108    consuming more space in pg_statistic and slightly more time to compute
 109    the estimates. Conversely, a lower limit might be sufficient for
 110    columns with simple data distributions.
 111
 112    Further details about the planner's use of statistics can be found in
 113    Chapter 69.
 114
 115 14.2.2. Extended Statistics #
 116
 117    It is common to see slow queries running bad execution plans because
 118    multiple columns used in the query clauses are correlated. The planner
 119    normally assumes that multiple conditions are independent of each
 120    other, an assumption that does not hold when column values are
 121    correlated. Regular statistics, because of their per-individual-column
 122    nature, cannot capture any knowledge about cross-column correlation.
 123    However, PostgreSQL has the ability to compute multivariate statistics,
 124    which can capture such information.
 125
 126    Because the number of possible column combinations is very large, it's
 127    impractical to compute multivariate statistics automatically. Instead,
 128    extended statistics objects, more often called just statistics objects,
 129    can be created to instruct the server to obtain statistics across
 130    interesting sets of columns.
 131
 132    Statistics objects are created using the CREATE STATISTICS command.
 133    Creation of such an object merely creates a catalog entry expressing
 134    interest in the statistics. Actual data collection is performed by
 135    ANALYZE (either a manual command, or background auto-analyze). The
 136    collected values can be examined in the pg_statistic_ext_data catalog.
 137
 138    ANALYZE computes extended statistics based on the same sample of table
 139    rows that it takes for computing regular single-column statistics.
 140    Since the sample size is increased by increasing the statistics target
 141    for the table or any of its columns (as described in the previous
 142    section), a larger statistics target will normally result in more
 143    accurate extended statistics, as well as more time spent calculating
 144    them.
 145
 146    The following subsections describe the kinds of extended statistics
 147    that are currently supported.
 148
 149 14.2.2.1. Functional Dependencies #
 150
 151    The simplest kind of extended statistics tracks functional
 152    dependencies, a concept used in definitions of database normal forms.
 153    We say that column b is functionally dependent on column a if knowledge
 154    of the value of a is sufficient to determine the value of b, that is
 155    there are no two rows having the same value of a but different values
 156    of b. In a fully normalized database, functional dependencies should
 157    exist only on primary keys and superkeys. However, in practice many
 158    data sets are not fully normalized for various reasons; intentional
 159    denormalization for performance reasons is a common example. Even in a
 160    fully normalized database, there may be partial correlation between
 161    some columns, which can be expressed as partial functional dependency.
 162
 163    The existence of functional dependencies directly affects the accuracy
 164    of estimates in certain queries. If a query contains conditions on both
 165    the independent and the dependent column(s), the conditions on the
 166    dependent columns do not further reduce the result size; but without
 167    knowledge of the functional dependency, the query planner will assume
 168    that the conditions are independent, resulting in underestimating the
 169    result size.
 170
 171    To inform the planner about functional dependencies, ANALYZE can
 172    collect measurements of cross-column dependency. Assessing the degree
 173    of dependency between all sets of columns would be prohibitively
 174    expensive, so data collection is limited to those groups of columns
 175    appearing together in a statistics object defined with the dependencies
 176    option. It is advisable to create dependencies statistics only for
 177    column groups that are strongly correlated, to avoid unnecessary
 178    overhead in both ANALYZE and later query planning.
 179
 180    Here is an example of collecting functional-dependency statistics:
 181 CREATE STATISTICS stts (dependencies) ON city, zip FROM zipcodes;
 182
 183 ANALYZE zipcodes;
 184
 185 SELECT stxname, stxkeys, stxddependencies
 186   FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid)
 187   WHERE stxname = 'stts';
 188  stxname | stxkeys |             stxddependencies
 189 ---------+---------+------------------------------------------
 190  stts    | 1 5     | {"1 => 5": 1.000000, "5 => 1": 0.423130}
 191 (1 row)
 192
 193    Here it can be seen that column 1 (zip code) fully determines column 5
 194    (city) so the coefficient is 1.0, while city only determines zip code
 195    about 42% of the time, meaning that there are many cities (58%) that
 196    are represented by more than a single ZIP code.
 197
 198    When computing the selectivity for a query involving functionally
 199    dependent columns, the planner adjusts the per-condition selectivity
 200    estimates using the dependency coefficients so as not to produce an
 201    underestimate.
 202
 203 14.2.2.1.1. Limitations of Functional Dependencies #
 204
 205    Functional dependencies are currently only applied when considering
 206    simple equality conditions that compare columns to constant values, and
 207    IN clauses with constant values. They are not used to improve estimates
 208    for equality conditions comparing two columns or comparing a column to
 209    an expression, nor for range clauses, LIKE or any other type of
 210    condition.
 211
 212    When estimating with functional dependencies, the planner assumes that
 213    conditions on the involved columns are compatible and hence redundant.
 214    If they are incompatible, the correct estimate would be zero rows, but
 215    that possibility is not considered. For example, given a query like
 216 SELECT * FROM zipcodes WHERE city = 'San Francisco' AND zip = '94105';
 217
 218    the planner will disregard the city clause as not changing the
 219    selectivity, which is correct. However, it will make the same
 220    assumption about
 221 SELECT * FROM zipcodes WHERE city = 'San Francisco' AND zip = '90210';
 222
 223    even though there will really be zero rows satisfying this query.
 224    Functional dependency statistics do not provide enough information to
 225    conclude that, however.
 226
 227    In many practical situations, this assumption is usually satisfied; for
 228    example, there might be a GUI in the application that only allows
 229    selecting compatible city and ZIP code values to use in a query. But if
 230    that's not the case, functional dependencies may not be a viable
 231    option.
 232
 233 14.2.2.2. Multivariate N-Distinct Counts #
 234
 235    Single-column statistics store the number of distinct values in each
 236    column. Estimates of the number of distinct values when combining more
 237    than one column (for example, for GROUP BY a, b) are frequently wrong
 238    when the planner only has single-column statistical data, causing it to
 239    select bad plans.
 240
 241    To improve such estimates, ANALYZE can collect n-distinct statistics
 242    for groups of columns. As before, it's impractical to do this for every
 243    possible column grouping, so data is collected only for those groups of
 244    columns appearing together in a statistics object defined with the
 245    ndistinct option. Data will be collected for each possible combination
 246    of two or more columns from the set of listed columns.
 247
 248    Continuing the previous example, the n-distinct counts in a table of
 249    ZIP codes might look like the following:
 250 CREATE STATISTICS stts2 (ndistinct) ON city, state, zip FROM zipcodes;
 251
 252 ANALYZE zipcodes;
 253
 254 SELECT stxkeys AS k, stxdndistinct AS nd
 255   FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid)
 256   WHERE stxname = 'stts2';
 257 -[ RECORD 1 ]--------------------------------------------------------
 258 k  | 1 2 5
 259 nd | {"1, 2": 33178, "1, 5": 33178, "2, 5": 27435, "1, 2, 5": 33178}
 260 (1 row)
 261
 262    This indicates that there are three combinations of columns that have
 263    33178 distinct values: ZIP code and state; ZIP code and city; and ZIP
 264    code, city and state (the fact that they are all equal is expected
 265    given that ZIP code alone is unique in this table). On the other hand,
 266    the combination of city and state has only 27435 distinct values.
 267
 268    It's advisable to create ndistinct statistics objects only on
 269    combinations of columns that are actually used for grouping, and for
 270    which misestimation of the number of groups is resulting in bad plans.
 271    Otherwise, the ANALYZE cycles are just wasted.
 272
 273 14.2.2.3. Multivariate MCV Lists #
 274
 275    Another type of statistic stored for each column are most-common value
 276    lists. This allows very accurate estimates for individual columns, but
 277    may result in significant misestimates for queries with conditions on
 278    multiple columns.
 279
 280    To improve such estimates, ANALYZE can collect MCV lists on
 281    combinations of columns. Similarly to functional dependencies and
 282    n-distinct coefficients, it's impractical to do this for every possible
 283    column grouping. Even more so in this case, as the MCV list (unlike
 284    functional dependencies and n-distinct coefficients) does store the
 285    common column values. So data is collected only for those groups of
 286    columns appearing together in a statistics object defined with the mcv
 287    option.
 288
 289    Continuing the previous example, the MCV list for a table of ZIP codes
 290    might look like the following (unlike for simpler types of statistics,
 291    a function is required for inspection of MCV contents):
 292 CREATE STATISTICS stts3 (mcv) ON city, state FROM zipcodes;
 293
 294 ANALYZE zipcodes;
 295
 296 SELECT m.* FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid),
 297                 pg_mcv_list_items(stxdmcv) m WHERE stxname = 'stts3';
 298
 299  index |         values         | nulls | frequency | base_frequency
 300 -------+------------------------+-------+-----------+----------------
 301      0 | {Washington, DC}       | {f,f} |  0.003467 |        2.7e-05
 302      1 | {Apo, AE}              | {f,f} |  0.003067 |        1.9e-05
 303      2 | {Houston, TX}          | {f,f} |  0.002167 |       0.000133
 304      3 | {El Paso, TX}          | {f,f} |     0.002 |       0.000113
 305      4 | {New York, NY}         | {f,f} |  0.001967 |       0.000114
 306      5 | {Atlanta, GA}          | {f,f} |  0.001633 |        3.3e-05
 307      6 | {Sacramento, CA}       | {f,f} |  0.001433 |        7.8e-05
 308      7 | {Miami, FL}            | {f,f} |    0.0014 |          6e-05
 309      8 | {Dallas, TX}           | {f,f} |  0.001367 |        8.8e-05
 310      9 | {Chicago, IL}          | {f,f} |  0.001333 |        5.1e-05
 311    ...
 312 (99 rows)
 313
 314    This indicates that the most common combination of city and state is
 315    Washington in DC, with actual frequency (in the sample) about 0.35%.
 316    The base frequency of the combination (as computed from the simple
 317    per-column frequencies) is only 0.0027%, resulting in two orders of
 318    magnitude under-estimates.
 319
 320    It's advisable to create MCV statistics objects only on combinations of
 321    columns that are actually used in conditions together, and for which
 322    misestimation of the number of groups is resulting in bad plans.
 323    Otherwise, the ANALYZE and planning cycles are just wasted.