begriffs open source - ai-pg/blob - full-docs/html/row-estimation-examples.html

   1 <?xml version="1.0" encoding="UTF-8" standalone="no"?>
   2 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>69.1. Row Estimation Examples</title><link rel="stylesheet" type="text/css" href="stylesheet.css" /><link rev="made" href="pgsql-docs@lists.postgresql.org" /><meta name="generator" content="DocBook XSL Stylesheets Vsnapshot" /><link rel="prev" href="planner-stats-details.html" title="Chapter 69. How the Planner Uses Statistics" /><link rel="next" href="multivariate-statistics-examples.html" title="69.2. Multivariate Statistics Examples" /></head><body id="docContent" class="container-fluid col-10"><div class="navheader"><table width="100%" summary="Navigation header"><tr><th colspan="5" align="center">69.1. Row Estimation Examples</th></tr><tr><td width="10%" align="left"><a accesskey="p" href="planner-stats-details.html" title="Chapter 69. How the Planner Uses Statistics">Prev</a> </td><td width="10%" align="left"><a accesskey="u" href="planner-stats-details.html" title="Chapter 69. How the Planner Uses Statistics">Up</a></td><th width="60%" align="center">Chapter 69. How the Planner Uses Statistics</th><td width="10%" align="right"><a accesskey="h" href="index.html" title="PostgreSQL 18.0 Documentation">Home</a></td><td width="10%" align="right"> <a accesskey="n" href="multivariate-statistics-examples.html" title="69.2. Multivariate Statistics Examples">Next</a></td></tr></table><hr /></div><div class="sect1" id="ROW-ESTIMATION-EXAMPLES"><div class="titlepage"><div><div><h2 class="title" style="clear: both">69.1. Row Estimation Examples <a href="#ROW-ESTIMATION-EXAMPLES" class="id_link">#</a></h2></div></div></div><a id="id-1.10.21.4.2" class="indexterm"></a><p>
   3    The examples shown below use tables in the <span class="productname">PostgreSQL</span>
   4    regression test database.
   5    Note also that since <code class="command">ANALYZE</code> uses random sampling
   6    while producing statistics, the results will change slightly after
   7    any new <code class="command">ANALYZE</code>.
   8   </p><p>
   9    Let's start with a very simple query:
  10
  11 </p><pre class="programlisting">
  12 EXPLAIN SELECT * FROM tenk1;
  13
  14                          QUERY PLAN
  15 -------------------------------------------------------------
  16  Seq Scan on tenk1  (cost=0.00..458.00 rows=10000 width=244)
  17 </pre><p>
  18
  19    How the planner determines the cardinality of <code class="structname">tenk1</code>
  20    is covered in <a class="xref" href="planner-stats.html" title="14.2. Statistics Used by the Planner">Section 14.2</a>, but is repeated here for
  21    completeness. The number of pages and rows is looked up in
  22    <code class="structname">pg_class</code>:
  23
  24 </p><pre class="programlisting">
  25 SELECT relpages, reltuples FROM pg_class WHERE relname = 'tenk1';
  26
  27  relpages | reltuples
  28 ----------+-----------
  29       358 |     10000
  30 </pre><p>
  31
  32     These numbers are current as of the last <code class="command">VACUUM</code> or
  33     <code class="command">ANALYZE</code> on the table.  The planner then fetches the
  34     actual current number of pages in the table (this is a cheap operation,
  35     not requiring a table scan).  If that is different from
  36     <code class="structfield">relpages</code> then
  37     <code class="structfield">reltuples</code> is scaled accordingly to
  38     arrive at a current number-of-rows estimate.  In the example above, the value of
  39     <code class="structfield">relpages</code> is up-to-date so the rows estimate is
  40     the same as <code class="structfield">reltuples</code>.
  41   </p><p>
  42    Let's move on to an example with a range condition in its
  43    <code class="literal">WHERE</code> clause:
  44
  45 </p><pre class="programlisting">
  46 EXPLAIN SELECT * FROM tenk1 WHERE unique1 &lt; 1000;
  47
  48                                    QUERY PLAN
  49 --------------------------------------------------------------------------------
  50  Bitmap Heap Scan on tenk1  (cost=24.06..394.64 rows=1007 width=244)
  51    Recheck Cond: (unique1 &lt; 1000)
  52    -&gt;  Bitmap Index Scan on tenk1_unique1  (cost=0.00..23.80 rows=1007 width=0)
  53          Index Cond: (unique1 &lt; 1000)
  54 </pre><p>
  55
  56    The planner examines the <code class="literal">WHERE</code> clause condition
  57    and looks up the selectivity function for the operator
  58    <code class="literal">&lt;</code> in <code class="structname">pg_operator</code>.
  59    This is held in the column <code class="structfield">oprrest</code>,
  60    and the entry in this case is <code class="function">scalarltsel</code>.
  61    The <code class="function">scalarltsel</code> function retrieves the histogram for
  62    <code class="structfield">unique1</code> from
  63    <code class="structname">pg_statistic</code>.  For manual queries it is more
  64    convenient to look in the simpler <code class="structname">pg_stats</code>
  65    view:
  66
  67 </p><pre class="programlisting">
  68 SELECT histogram_bounds FROM pg_stats
  69 WHERE tablename='tenk1' AND attname='unique1';
  70
  71                    histogram_bounds
  72 ------------------------------------------------------
  73  {0,993,1997,3050,4040,5036,5957,7057,8029,9016,9995}
  74 </pre><p>
  75
  76    Next the fraction of the histogram occupied by <span class="quote">“<span class="quote">&lt; 1000</span>”</span>
  77    is worked out. This is the selectivity. The histogram divides the range
  78    into equal frequency buckets, so all we have to do is locate the bucket
  79    that our value is in and count <span class="emphasis"><em>part</em></span> of it and
  80    <span class="emphasis"><em>all</em></span> of the ones before. The value 1000 is clearly in
  81    the second bucket (993–1997).  Assuming a linear distribution of
  82    values inside each bucket, we can calculate the selectivity as:
  83
  84 </p><pre class="programlisting">
  85 selectivity = (1 + (1000 - bucket[2].min)/(bucket[2].max - bucket[2].min))/num_buckets
  86             = (1 + (1000 - 993)/(1997 - 993))/10
  87             = 0.100697
  88 </pre><p>
  89
  90    that is, one whole bucket plus a linear fraction of the second, divided by
  91    the number of buckets. The estimated number of rows can now be calculated as
  92    the product of the selectivity and the cardinality of
  93    <code class="structname">tenk1</code>:
  94
  95 </p><pre class="programlisting">
  96 rows = rel_cardinality * selectivity
  97      = 10000 * 0.100697
  98      = 1007  (rounding off)
  99 </pre><p>
 100   </p><p>
 101    Next let's consider an example with an equality condition in its
 102    <code class="literal">WHERE</code> clause:
 103
 104 </p><pre class="programlisting">
 105 EXPLAIN SELECT * FROM tenk1 WHERE stringu1 = 'CRAAAA';
 106
 107                         QUERY PLAN
 108 ----------------------------------------------------------
 109  Seq Scan on tenk1  (cost=0.00..483.00 rows=30 width=244)
 110    Filter: (stringu1 = 'CRAAAA'::name)
 111 </pre><p>
 112
 113    Again the planner examines the <code class="literal">WHERE</code> clause condition
 114    and looks up the selectivity function for <code class="literal">=</code>, which is
 115    <code class="function">eqsel</code>.  For equality estimation the histogram is
 116    not useful; instead the list of <em class="firstterm">most
 117    common values</em> (<acronym class="acronym">MCV</acronym>s) is used to determine the
 118    selectivity. Let's have a look at the MCVs, with some additional columns
 119    that will be useful later:
 120
 121 </p><pre class="programlisting">
 122 SELECT null_frac, n_distinct, most_common_vals, most_common_freqs FROM pg_stats
 123 WHERE tablename='tenk1' AND attname='stringu1';
 124
 125 null_frac         | 0
 126 n_distinct        | 676
 127 most_common_vals  | {EJAAAA,BBAAAA,CRAAAA,FCAAAA,FEAAAA,GSAAAA,JOAAAA,MCAAAA,NAAAAA,WGAAAA}
 128 most_common_freqs | {0.00333333,0.003,0.003,0.003,0.003,0.003,0.003,0.003,0.003,0.003}
 129
 130 </pre><p>
 131
 132    Since <code class="literal">CRAAAA</code> appears in the list of MCVs, the selectivity is
 133    merely the corresponding entry in the list of most common frequencies
 134    (<acronym class="acronym">MCF</acronym>s):
 135
 136 </p><pre class="programlisting">
 137 selectivity = mcf[3]
 138             = 0.003
 139 </pre><p>
 140
 141    As before, the estimated number of rows is just the product of this with the
 142    cardinality of <code class="structname">tenk1</code>:
 143
 144 </p><pre class="programlisting">
 145 rows = 10000 * 0.003
 146      = 30
 147 </pre><p>
 148   </p><p>
 149    Now consider the same query, but with a constant that is not in the
 150    <acronym class="acronym">MCV</acronym> list:
 151
 152 </p><pre class="programlisting">
 153 EXPLAIN SELECT * FROM tenk1 WHERE stringu1 = 'xxx';
 154
 155                         QUERY PLAN
 156 ----------------------------------------------------------
 157  Seq Scan on tenk1  (cost=0.00..483.00 rows=15 width=244)
 158    Filter: (stringu1 = 'xxx'::name)
 159 </pre><p>
 160
 161    This is quite a different problem: how to estimate the selectivity when the
 162    value is <span class="emphasis"><em>not</em></span> in the <acronym class="acronym">MCV</acronym> list.
 163    The approach is to use the fact that the value is not in the list,
 164    combined with the knowledge of the frequencies for all of the
 165    <acronym class="acronym">MCV</acronym>s:
 166
 167 </p><pre class="programlisting">
 168 selectivity = (1 - sum(mcv_freqs))/(num_distinct - num_mcv)
 169             = (1 - (0.00333333 + 0.003 + 0.003 + 0.003 + 0.003 + 0.003 +
 170                     0.003 + 0.003 + 0.003 + 0.003))/(676 - 10)
 171             = 0.0014559
 172 </pre><p>
 173
 174    That is, add up all the frequencies for the <acronym class="acronym">MCV</acronym>s and
 175    subtract them from one, then
 176    divide by the number of <span class="emphasis"><em>other</em></span> distinct values.
 177    This amounts to assuming that the fraction of the column that is not any
 178    of the MCVs is evenly distributed among all the other distinct values.
 179    Notice that there are no null values so we don't have to worry about those
 180    (otherwise we'd subtract the null fraction from the numerator as well).
 181    The estimated number of rows is then calculated as usual:
 182
 183 </p><pre class="programlisting">
 184 rows = 10000 * 0.0014559
 185      = 15  (rounding off)
 186 </pre><p>
 187   </p><p>
 188    The previous example with <code class="literal">unique1 &lt; 1000</code> was an
 189    oversimplification of what <code class="function">scalarltsel</code> really does;
 190    now that we have seen an example of the use of MCVs, we can fill in some
 191    more detail.  The example was correct as far as it went, because since
 192    <code class="structfield">unique1</code> is a unique column it has no MCVs (obviously, no
 193    value is any more common than any other value).  For a non-unique
 194    column, there will normally be both a histogram and an MCV list, and
 195    <span class="emphasis"><em>the histogram does not include the portion of the column
 196    population represented by the MCVs</em></span>.  We do things this way because
 197    it allows more precise estimation.  In this situation
 198    <code class="function">scalarltsel</code> directly applies the condition (e.g.,
 199    <span class="quote">“<span class="quote">&lt; 1000</span>”</span>) to each value of the MCV list, and adds up the
 200    frequencies of the MCVs for which the condition is true.  This gives
 201    an exact estimate of the selectivity within the portion of the table
 202    that is MCVs.  The histogram is then used in the same way as above
 203    to estimate the selectivity in the portion of the table that is not
 204    MCVs, and then the two numbers are combined to estimate the overall
 205    selectivity.  For example, consider
 206
 207 </p><pre class="programlisting">
 208 EXPLAIN SELECT * FROM tenk1 WHERE stringu1 &lt; 'IAAAAA';
 209
 210                          QUERY PLAN
 211 ------------------------------------------------------------
 212  Seq Scan on tenk1  (cost=0.00..483.00 rows=3077 width=244)
 213    Filter: (stringu1 &lt; 'IAAAAA'::name)
 214 </pre><p>
 215
 216    We already saw the MCV information for <code class="structfield">stringu1</code>,
 217    and here is its histogram:
 218
 219 </p><pre class="programlisting">
 220 SELECT histogram_bounds FROM pg_stats
 221 WHERE tablename='tenk1' AND attname='stringu1';
 222
 223                                 histogram_bounds
 224 --------------------------------------------------------------------------------
 225  {AAAAAA,CQAAAA,FRAAAA,IBAAAA,KRAAAA,NFAAAA,PSAAAA,SGAAAA,VAAAAA,XLAAAA,ZZAAAA}
 226 </pre><p>
 227
 228    Checking the MCV list, we find that the condition <code class="literal">stringu1 &lt;
 229    'IAAAAA'</code> is satisfied by the first six entries and not the last four,
 230    so the selectivity within the MCV part of the population is
 231
 232 </p><pre class="programlisting">
 233 selectivity = sum(relevant mvfs)
 234             = 0.00333333 + 0.003 + 0.003 + 0.003 + 0.003 + 0.003
 235             = 0.01833333
 236 </pre><p>
 237
 238    Summing all the MCFs also tells us that the total fraction of the
 239    population represented by MCVs is 0.03033333, and therefore the
 240    fraction represented by the histogram is 0.96966667 (again, there
 241    are no nulls, else we'd have to exclude them here).  We can see
 242    that the value <code class="literal">IAAAAA</code> falls nearly at the end of the
 243    third histogram bucket.  Using some rather cheesy assumptions
 244    about the frequency of different characters, the planner arrives
 245    at the estimate 0.298387 for the portion of the histogram population
 246    that is less than <code class="literal">IAAAAA</code>.  We then combine the estimates
 247    for the MCV and non-MCV populations:
 248
 249 </p><pre class="programlisting">
 250 selectivity = mcv_selectivity + histogram_selectivity * histogram_fraction
 251             = 0.01833333 + 0.298387 * 0.96966667
 252             = 0.307669
 253
 254 rows        = 10000 * 0.307669
 255             = 3077  (rounding off)
 256 </pre><p>
 257
 258    In this particular example, the correction from the MCV list is fairly
 259    small, because the column distribution is actually quite flat (the
 260    statistics showing these particular values as being more common than
 261    others are mostly due to sampling error).  In a more typical case where
 262    some values are significantly more common than others, this complicated
 263    process gives a useful improvement in accuracy because the selectivity
 264    for the most common values is found exactly.
 265   </p><p>
 266    Now let's consider a case with more than one
 267    condition in the <code class="literal">WHERE</code> clause:
 268
 269 </p><pre class="programlisting">
 270 EXPLAIN SELECT * FROM tenk1 WHERE unique1 &lt; 1000 AND stringu1 = 'xxx';
 271
 272                                    QUERY PLAN
 273 --------------------------------------------------------------------------------
 274  Bitmap Heap Scan on tenk1  (cost=23.80..396.91 rows=1 width=244)
 275    Recheck Cond: (unique1 &lt; 1000)
 276    Filter: (stringu1 = 'xxx'::name)
 277    -&gt;  Bitmap Index Scan on tenk1_unique1  (cost=0.00..23.80 rows=1007 width=0)
 278          Index Cond: (unique1 &lt; 1000)
 279 </pre><p>
 280
 281    The planner assumes that the two conditions are independent, so that
 282    the individual selectivities of the clauses can be multiplied together:
 283
 284 </p><pre class="programlisting">
 285 selectivity = selectivity(unique1 &lt; 1000) * selectivity(stringu1 = 'xxx')
 286             = 0.100697 * 0.0014559
 287             = 0.0001466
 288
 289 rows        = 10000 * 0.0001466
 290             = 1  (rounding off)
 291 </pre><p>
 292
 293    Notice that the number of rows estimated to be returned from the bitmap
 294    index scan reflects only the condition used with the index; this is
 295    important since it affects the cost estimate for the subsequent heap
 296    fetches.
 297   </p><p>
 298    Finally we will examine a query that involves a join:
 299
 300 </p><pre class="programlisting">
 301 EXPLAIN SELECT * FROM tenk1 t1, tenk2 t2
 302 WHERE t1.unique1 &lt; 50 AND t1.unique2 = t2.unique2;
 303
 304                                       QUERY PLAN
 305 --------------------------------------------------------------------------------------
 306  Nested Loop  (cost=4.64..456.23 rows=50 width=488)
 307    -&gt;  Bitmap Heap Scan on tenk1 t1  (cost=4.64..142.17 rows=50 width=244)
 308          Recheck Cond: (unique1 &lt; 50)
 309          -&gt;  Bitmap Index Scan on tenk1_unique1  (cost=0.00..4.63 rows=50 width=0)
 310                Index Cond: (unique1 &lt; 50)
 311    -&gt;  Index Scan using tenk2_unique2 on tenk2 t2  (cost=0.00..6.27 rows=1 width=244)
 312          Index Cond: (unique2 = t1.unique2)
 313 </pre><p>
 314
 315    The restriction on <code class="structname">tenk1</code>,
 316    <code class="literal">unique1 &lt; 50</code>,
 317    is evaluated before the nested-loop join.
 318    This is handled analogously to the previous range example.  This time the
 319    value 50 falls into the first bucket of the
 320    <code class="structfield">unique1</code> histogram:
 321
 322 </p><pre class="programlisting">
 323 selectivity = (0 + (50 - bucket[1].min)/(bucket[1].max - bucket[1].min))/num_buckets
 324             = (0 + (50 - 0)/(993 - 0))/10
 325             = 0.005035
 326
 327 rows        = 10000 * 0.005035
 328             = 50  (rounding off)
 329 </pre><p>
 330
 331    The restriction for the join is <code class="literal">t2.unique2 = t1.unique2</code>.
 332    The operator is just
 333    our familiar <code class="literal">=</code>, however the selectivity function is
 334    obtained from the <code class="structfield">oprjoin</code> column of
 335    <code class="structname">pg_operator</code>, and is <code class="function">eqjoinsel</code>.
 336    <code class="function">eqjoinsel</code> looks up the statistical information for both
 337    <code class="structname">tenk2</code> and <code class="structname">tenk1</code>:
 338
 339 </p><pre class="programlisting">
 340 SELECT tablename, null_frac,n_distinct, most_common_vals FROM pg_stats
 341 WHERE tablename IN ('tenk1', 'tenk2') AND attname='unique2';
 342
 343 tablename  | null_frac | n_distinct | most_common_vals
 344 -----------+-----------+------------+------------------
 345  tenk1     |         0 |         -1 |
 346  tenk2     |         0 |         -1 |
 347 </pre><p>
 348
 349    In this case there is no <acronym class="acronym">MCV</acronym> information for
 350    <code class="structname">unique2</code> and all the values appear to be
 351    unique (n_distinct = -1), so we use an algorithm that relies on the row
 352    count estimates for both relations (num_rows, not shown, but "tenk")
 353    together with the column null fractions (zero for both):
 354
 355 </p><pre class="programlisting">
 356 selectivity = (1 - null_frac1) * (1 - null_frac2) / max(num_rows1, num_rows2)
 357             = (1 - 0) * (1 - 0) / max(10000, 10000)
 358             = 0.0001
 359 </pre><p>
 360
 361    This is, subtract the null fraction from one for each of the relations,
 362    and divide by the row count of the larger relation (this value does get
 363    scaled in the non-unique case).
 364    The number of rows
 365    that the join is likely to emit is calculated as the cardinality of the
 366    Cartesian product of the two inputs, multiplied by the
 367    selectivity:
 368
 369 </p><pre class="programlisting">
 370 rows = (outer_cardinality * inner_cardinality) * selectivity
 371      = (50 * 10000) * 0.0001
 372      = 50
 373 </pre><p>
 374   </p><p>
 375    Had there been MCV lists for the two columns,
 376    <code class="function">eqjoinsel</code> would have used direct comparison of the MCV
 377    lists to determine the join selectivity within the part of the column
 378    populations represented by the MCVs.  The estimate for the remainder of the
 379    populations follows the same approach shown here.
 380   </p><p>
 381    Notice that we showed <code class="literal">inner_cardinality</code> as 10000, that is,
 382    the unmodified size of <code class="structname">tenk2</code>.  It might appear from
 383    inspection of the <code class="command">EXPLAIN</code> output that the estimate of
 384    join rows comes from 50 * 1, that is, the number of outer rows times
 385    the estimated number of rows obtained by each inner index scan on
 386    <code class="structname">tenk2</code>.  But this is not the case: the join relation size
 387    is estimated before any particular join plan has been considered.  If
 388    everything is working well then the two ways of estimating the join
 389    size will produce about the same answer, but due to round-off error and
 390    other factors they sometimes diverge significantly.
 391   </p><p>
 392    For those interested in further details, estimation of the size of
 393    a table (before any <code class="literal">WHERE</code> clauses) is done in
 394    <code class="filename">src/backend/optimizer/util/plancat.c</code>. The generic
 395    logic for clause selectivities is in
 396    <code class="filename">src/backend/optimizer/path/clausesel.c</code>.  The
 397    operator-specific selectivity functions are mostly found
 398    in <code class="filename">src/backend/utils/adt/selfuncs.c</code>.
 399   </p></div><div class="navfooter"><hr /><table width="100%" summary="Navigation footer"><tr><td width="40%" align="left"><a accesskey="p" href="planner-stats-details.html" title="Chapter 69. How the Planner Uses Statistics">Prev</a> </td><td width="20%" align="center"><a accesskey="u" href="planner-stats-details.html" title="Chapter 69. How the Planner Uses Statistics">Up</a></td><td width="40%" align="right"> <a accesskey="n" href="multivariate-statistics-examples.html" title="69.2. Multivariate Statistics Examples">Next</a></td></tr><tr><td width="40%" align="left" valign="top">Chapter 69. How the Planner Uses Statistics </td><td width="20%" align="center"><a accesskey="h" href="index.html" title="PostgreSQL 18.0 Documentation">Home</a></td><td width="40%" align="right" valign="top"> 69.2. Multivariate Statistics Examples</td></tr></table></div></body></html>