begriffs open source - ai-pg/blob - full-docs/txt/row-estimation-examples.txt

   1
   2 69.1. Row Estimation Examples #
   3
   4    The examples shown below use tables in the PostgreSQL regression test
   5    database. Note also that since ANALYZE uses random sampling while
   6    producing statistics, the results will change slightly after any new
   7    ANALYZE.
   8
   9    Let's start with a very simple query:
  10 EXPLAIN SELECT * FROM tenk1;
  11
  12                          QUERY PLAN
  13 -------------------------------------------------------------
  14  Seq Scan on tenk1  (cost=0.00..458.00 rows=10000 width=244)
  15
  16    How the planner determines the cardinality of tenk1 is covered in
  17    Section 14.2, but is repeated here for completeness. The number of
  18    pages and rows is looked up in pg_class:
  19 SELECT relpages, reltuples FROM pg_class WHERE relname = 'tenk1';
  20
  21  relpages | reltuples
  22 ----------+-----------
  23       358 |     10000
  24
  25    These numbers are current as of the last VACUUM or ANALYZE on the
  26    table. The planner then fetches the actual current number of pages in
  27    the table (this is a cheap operation, not requiring a table scan). If
  28    that is different from relpages then reltuples is scaled accordingly to
  29    arrive at a current number-of-rows estimate. In the example above, the
  30    value of relpages is up-to-date so the rows estimate is the same as
  31    reltuples.
  32
  33    Let's move on to an example with a range condition in its WHERE clause:
  34 EXPLAIN SELECT * FROM tenk1 WHERE unique1 < 1000;
  35
  36                                    QUERY PLAN
  37 -------------------------------------------------------------------------------
  38 -
  39  Bitmap Heap Scan on tenk1  (cost=24.06..394.64 rows=1007 width=244)
  40    Recheck Cond: (unique1 < 1000)
  41    ->  Bitmap Index Scan on tenk1_unique1  (cost=0.00..23.80 rows=1007 width=0)
  42          Index Cond: (unique1 < 1000)
  43
  44    The planner examines the WHERE clause condition and looks up the
  45    selectivity function for the operator < in pg_operator. This is held in
  46    the column oprrest, and the entry in this case is scalarltsel. The
  47    scalarltsel function retrieves the histogram for unique1 from
  48    pg_statistic. For manual queries it is more convenient to look in the
  49    simpler pg_stats view:
  50 SELECT histogram_bounds FROM pg_stats
  51 WHERE tablename='tenk1' AND attname='unique1';
  52
  53                    histogram_bounds
  54 ------------------------------------------------------
  55  {0,993,1997,3050,4040,5036,5957,7057,8029,9016,9995}
  56
  57    Next the fraction of the histogram occupied by “< 1000” is worked out.
  58    This is the selectivity. The histogram divides the range into equal
  59    frequency buckets, so all we have to do is locate the bucket that our
  60    value is in and count part of it and all of the ones before. The value
  61    1000 is clearly in the second bucket (993–1997). Assuming a linear
  62    distribution of values inside each bucket, we can calculate the
  63    selectivity as:
  64 selectivity = (1 + (1000 - bucket[2].min)/(bucket[2].max - bucket[2].min))/num_b
  65 uckets
  66             = (1 + (1000 - 993)/(1997 - 993))/10
  67             = 0.100697
  68
  69    that is, one whole bucket plus a linear fraction of the second, divided
  70    by the number of buckets. The estimated number of rows can now be
  71    calculated as the product of the selectivity and the cardinality of
  72    tenk1:
  73 rows = rel_cardinality * selectivity
  74      = 10000 * 0.100697
  75      = 1007  (rounding off)
  76
  77    Next let's consider an example with an equality condition in its WHERE
  78    clause:
  79 EXPLAIN SELECT * FROM tenk1 WHERE stringu1 = 'CRAAAA';
  80
  81                         QUERY PLAN
  82 ----------------------------------------------------------
  83  Seq Scan on tenk1  (cost=0.00..483.00 rows=30 width=244)
  84    Filter: (stringu1 = 'CRAAAA'::name)
  85
  86    Again the planner examines the WHERE clause condition and looks up the
  87    selectivity function for =, which is eqsel. For equality estimation the
  88    histogram is not useful; instead the list of most common values (MCVs)
  89    is used to determine the selectivity. Let's have a look at the MCVs,
  90    with some additional columns that will be useful later:
  91 SELECT null_frac, n_distinct, most_common_vals, most_common_freqs FROM pg_stats
  92 WHERE tablename='tenk1' AND attname='stringu1';
  93
  94 null_frac         | 0
  95 n_distinct        | 676
  96 most_common_vals  | {EJAAAA,BBAAAA,CRAAAA,FCAAAA,FEAAAA,GSAAAA,JOAAAA,MCAAAA,NA
  97 AAAA,WGAAAA}
  98 most_common_freqs | {0.00333333,0.003,0.003,0.003,0.003,0.003,0.003,0.003,0.003
  99 ,0.003}
 100
 101
 102    Since CRAAAA appears in the list of MCVs, the selectivity is merely the
 103    corresponding entry in the list of most common frequencies (MCFs):
 104 selectivity = mcf[3]
 105             = 0.003
 106
 107    As before, the estimated number of rows is just the product of this
 108    with the cardinality of tenk1:
 109 rows = 10000 * 0.003
 110      = 30
 111
 112    Now consider the same query, but with a constant that is not in the MCV
 113    list:
 114 EXPLAIN SELECT * FROM tenk1 WHERE stringu1 = 'xxx';
 115
 116                         QUERY PLAN
 117 ----------------------------------------------------------
 118  Seq Scan on tenk1  (cost=0.00..483.00 rows=15 width=244)
 119    Filter: (stringu1 = 'xxx'::name)
 120
 121    This is quite a different problem: how to estimate the selectivity when
 122    the value is not in the MCV list. The approach is to use the fact that
 123    the value is not in the list, combined with the knowledge of the
 124    frequencies for all of the MCVs:
 125 selectivity = (1 - sum(mcv_freqs))/(num_distinct - num_mcv)
 126             = (1 - (0.00333333 + 0.003 + 0.003 + 0.003 + 0.003 + 0.003 +
 127                     0.003 + 0.003 + 0.003 + 0.003))/(676 - 10)
 128             = 0.0014559
 129
 130    That is, add up all the frequencies for the MCVs and subtract them from
 131    one, then divide by the number of other distinct values. This amounts
 132    to assuming that the fraction of the column that is not any of the MCVs
 133    is evenly distributed among all the other distinct values. Notice that
 134    there are no null values so we don't have to worry about those
 135    (otherwise we'd subtract the null fraction from the numerator as well).
 136    The estimated number of rows is then calculated as usual:
 137 rows = 10000 * 0.0014559
 138      = 15  (rounding off)
 139
 140    The previous example with unique1 < 1000 was an oversimplification of
 141    what scalarltsel really does; now that we have seen an example of the
 142    use of MCVs, we can fill in some more detail. The example was correct
 143    as far as it went, because since unique1 is a unique column it has no
 144    MCVs (obviously, no value is any more common than any other value). For
 145    a non-unique column, there will normally be both a histogram and an MCV
 146    list, and the histogram does not include the portion of the column
 147    population represented by the MCVs. We do things this way because it
 148    allows more precise estimation. In this situation scalarltsel directly
 149    applies the condition (e.g., “< 1000”) to each value of the MCV list,
 150    and adds up the frequencies of the MCVs for which the condition is
 151    true. This gives an exact estimate of the selectivity within the
 152    portion of the table that is MCVs. The histogram is then used in the
 153    same way as above to estimate the selectivity in the portion of the
 154    table that is not MCVs, and then the two numbers are combined to
 155    estimate the overall selectivity. For example, consider
 156 EXPLAIN SELECT * FROM tenk1 WHERE stringu1 < 'IAAAAA';
 157
 158                          QUERY PLAN
 159 ------------------------------------------------------------
 160  Seq Scan on tenk1  (cost=0.00..483.00 rows=3077 width=244)
 161    Filter: (stringu1 < 'IAAAAA'::name)
 162
 163    We already saw the MCV information for stringu1, and here is its
 164    histogram:
 165 SELECT histogram_bounds FROM pg_stats
 166 WHERE tablename='tenk1' AND attname='stringu1';
 167
 168                                 histogram_bounds
 169 -------------------------------------------------------------------------------
 170 -
 171  {AAAAAA,CQAAAA,FRAAAA,IBAAAA,KRAAAA,NFAAAA,PSAAAA,SGAAAA,VAAAAA,XLAAAA,ZZAAAA}
 172
 173    Checking the MCV list, we find that the condition stringu1 < 'IAAAAA'
 174    is satisfied by the first six entries and not the last four, so the
 175    selectivity within the MCV part of the population is
 176 selectivity = sum(relevant mvfs)
 177             = 0.00333333 + 0.003 + 0.003 + 0.003 + 0.003 + 0.003
 178             = 0.01833333
 179
 180    Summing all the MCFs also tells us that the total fraction of the
 181    population represented by MCVs is 0.03033333, and therefore the
 182    fraction represented by the histogram is 0.96966667 (again, there are
 183    no nulls, else we'd have to exclude them here). We can see that the
 184    value IAAAAA falls nearly at the end of the third histogram bucket.
 185    Using some rather cheesy assumptions about the frequency of different
 186    characters, the planner arrives at the estimate 0.298387 for the
 187    portion of the histogram population that is less than IAAAAA. We then
 188    combine the estimates for the MCV and non-MCV populations:
 189 selectivity = mcv_selectivity + histogram_selectivity * histogram_fraction
 190             = 0.01833333 + 0.298387 * 0.96966667
 191             = 0.307669
 192
 193 rows        = 10000 * 0.307669
 194             = 3077  (rounding off)
 195
 196    In this particular example, the correction from the MCV list is fairly
 197    small, because the column distribution is actually quite flat (the
 198    statistics showing these particular values as being more common than
 199    others are mostly due to sampling error). In a more typical case where
 200    some values are significantly more common than others, this complicated
 201    process gives a useful improvement in accuracy because the selectivity
 202    for the most common values is found exactly.
 203
 204    Now let's consider a case with more than one condition in the WHERE
 205    clause:
 206 EXPLAIN SELECT * FROM tenk1 WHERE unique1 < 1000 AND stringu1 = 'xxx';
 207
 208                                    QUERY PLAN
 209 -------------------------------------------------------------------------------
 210 -
 211  Bitmap Heap Scan on tenk1  (cost=23.80..396.91 rows=1 width=244)
 212    Recheck Cond: (unique1 < 1000)
 213    Filter: (stringu1 = 'xxx'::name)
 214    ->  Bitmap Index Scan on tenk1_unique1  (cost=0.00..23.80 rows=1007 width=0)
 215          Index Cond: (unique1 < 1000)
 216
 217    The planner assumes that the two conditions are independent, so that
 218    the individual selectivities of the clauses can be multiplied together:
 219 selectivity = selectivity(unique1 < 1000) * selectivity(stringu1 = 'xxx')
 220             = 0.100697 * 0.0014559
 221             = 0.0001466
 222
 223 rows        = 10000 * 0.0001466
 224             = 1  (rounding off)
 225
 226    Notice that the number of rows estimated to be returned from the bitmap
 227    index scan reflects only the condition used with the index; this is
 228    important since it affects the cost estimate for the subsequent heap
 229    fetches.
 230
 231    Finally we will examine a query that involves a join:
 232 EXPLAIN SELECT * FROM tenk1 t1, tenk2 t2
 233 WHERE t1.unique1 < 50 AND t1.unique2 = t2.unique2;
 234
 235                                       QUERY PLAN
 236 -------------------------------------------------------------------------------
 237 -------
 238  Nested Loop  (cost=4.64..456.23 rows=50 width=488)
 239    ->  Bitmap Heap Scan on tenk1 t1  (cost=4.64..142.17 rows=50 width=244)
 240          Recheck Cond: (unique1 < 50)
 241          ->  Bitmap Index Scan on tenk1_unique1  (cost=0.00..4.63 rows=50 width=
 242 0)
 243                Index Cond: (unique1 < 50)
 244    ->  Index Scan using tenk2_unique2 on tenk2 t2  (cost=0.00..6.27 rows=1 width
 245 =244)
 246          Index Cond: (unique2 = t1.unique2)
 247
 248    The restriction on tenk1, unique1 < 50, is evaluated before the
 249    nested-loop join. This is handled analogously to the previous range
 250    example. This time the value 50 falls into the first bucket of the
 251    unique1 histogram:
 252 selectivity = (0 + (50 - bucket[1].min)/(bucket[1].max - bucket[1].min))/num_buc
 253 kets
 254             = (0 + (50 - 0)/(993 - 0))/10
 255             = 0.005035
 256
 257 rows        = 10000 * 0.005035
 258             = 50  (rounding off)
 259
 260    The restriction for the join is t2.unique2 = t1.unique2. The operator
 261    is just our familiar =, however the selectivity function is obtained
 262    from the oprjoin column of pg_operator, and is eqjoinsel. eqjoinsel
 263    looks up the statistical information for both tenk2 and tenk1:
 264 SELECT tablename, null_frac,n_distinct, most_common_vals FROM pg_stats
 265 WHERE tablename IN ('tenk1', 'tenk2') AND attname='unique2';
 266
 267 tablename  | null_frac | n_distinct | most_common_vals
 268 -----------+-----------+------------+------------------
 269  tenk1     |         0 |         -1 |
 270  tenk2     |         0 |         -1 |
 271
 272    In this case there is no MCV information for unique2 and all the values
 273    appear to be unique (n_distinct = -1), so we use an algorithm that
 274    relies on the row count estimates for both relations (num_rows, not
 275    shown, but "tenk") together with the column null fractions (zero for
 276    both):
 277 selectivity = (1 - null_frac1) * (1 - null_frac2) / max(num_rows1, num_rows2)
 278             = (1 - 0) * (1 - 0) / max(10000, 10000)
 279             = 0.0001
 280
 281    This is, subtract the null fraction from one for each of the relations,
 282    and divide by the row count of the larger relation (this value does get
 283    scaled in the non-unique case). The number of rows that the join is
 284    likely to emit is calculated as the cardinality of the Cartesian
 285    product of the two inputs, multiplied by the selectivity:
 286 rows = (outer_cardinality * inner_cardinality) * selectivity
 287      = (50 * 10000) * 0.0001
 288      = 50
 289
 290    Had there been MCV lists for the two columns, eqjoinsel would have used
 291    direct comparison of the MCV lists to determine the join selectivity
 292    within the part of the column populations represented by the MCVs. The
 293    estimate for the remainder of the populations follows the same approach
 294    shown here.
 295
 296    Notice that we showed inner_cardinality as 10000, that is, the
 297    unmodified size of tenk2. It might appear from inspection of the
 298    EXPLAIN output that the estimate of join rows comes from 50 * 1, that
 299    is, the number of outer rows times the estimated number of rows
 300    obtained by each inner index scan on tenk2. But this is not the case:
 301    the join relation size is estimated before any particular join plan has
 302    been considered. If everything is working well then the two ways of
 303    estimating the join size will produce about the same answer, but due to
 304    round-off error and other factors they sometimes diverge significantly.
 305
 306    For those interested in further details, estimation of the size of a
 307    table (before any WHERE clauses) is done in
 308    src/backend/optimizer/util/plancat.c. The generic logic for clause
 309    selectivities is in src/backend/optimizer/path/clausesel.c. The
 310    operator-specific selectivity functions are mostly found in
 311    src/backend/utils/adt/selfuncs.c.