1 <?xml version="1.0" encoding="UTF-8" standalone="no"?>
2 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>69.1. Row Estimation Examples</title><link rel="stylesheet" type="text/css" href="stylesheet.css" /><link rev="made" href="pgsql-docs@lists.postgresql.org" /><meta name="generator" content="DocBook XSL Stylesheets Vsnapshot" /><link rel="prev" href="planner-stats-details.html" title="Chapter 69. How the Planner Uses Statistics" /><link rel="next" href="multivariate-statistics-examples.html" title="69.2. Multivariate Statistics Examples" /></head><body id="docContent" class="container-fluid col-10"><div class="navheader"><table width="100%" summary="Navigation header"><tr><th colspan="5" align="center">69.1. Row Estimation Examples</th></tr><tr><td width="10%" align="left"><a accesskey="p" href="planner-stats-details.html" title="Chapter 69. How the Planner Uses Statistics">Prev</a> </td><td width="10%" align="left"><a accesskey="u" href="planner-stats-details.html" title="Chapter 69. How the Planner Uses Statistics">Up</a></td><th width="60%" align="center">Chapter 69. How the Planner Uses Statistics</th><td width="10%" align="right"><a accesskey="h" href="index.html" title="PostgreSQL 18.0 Documentation">Home</a></td><td width="10%" align="right"> <a accesskey="n" href="multivariate-statistics-examples.html" title="69.2. Multivariate Statistics Examples">Next</a></td></tr></table><hr /></div><div class="sect1" id="ROW-ESTIMATION-EXAMPLES"><div class="titlepage"><div><div><h2 class="title" style="clear: both">69.1. Row Estimation Examples <a href="#ROW-ESTIMATION-EXAMPLES" class="id_link">#</a></h2></div></div></div><a id="id-1.10.21.4.2" class="indexterm"></a><p>
3 The examples shown below use tables in the <span class="productname">PostgreSQL</span>
4 regression test database.
5 Note also that since <code class="command">ANALYZE</code> uses random sampling
6 while producing statistics, the results will change slightly after
7 any new <code class="command">ANALYZE</code>.
9 Let's start with a very simple query:
11 </p><pre class="programlisting">
12 EXPLAIN SELECT * FROM tenk1;
15 -------------------------------------------------------------
16 Seq Scan on tenk1 (cost=0.00..458.00 rows=10000 width=244)
19 How the planner determines the cardinality of <code class="structname">tenk1</code>
20 is covered in <a class="xref" href="planner-stats.html" title="14.2. Statistics Used by the Planner">Section 14.2</a>, but is repeated here for
21 completeness. The number of pages and rows is looked up in
22 <code class="structname">pg_class</code>:
24 </p><pre class="programlisting">
25 SELECT relpages, reltuples FROM pg_class WHERE relname = 'tenk1';
28 ----------+-----------
32 These numbers are current as of the last <code class="command">VACUUM</code> or
33 <code class="command">ANALYZE</code> on the table. The planner then fetches the
34 actual current number of pages in the table (this is a cheap operation,
35 not requiring a table scan). If that is different from
36 <code class="structfield">relpages</code> then
37 <code class="structfield">reltuples</code> is scaled accordingly to
38 arrive at a current number-of-rows estimate. In the example above, the value of
39 <code class="structfield">relpages</code> is up-to-date so the rows estimate is
40 the same as <code class="structfield">reltuples</code>.
42 Let's move on to an example with a range condition in its
43 <code class="literal">WHERE</code> clause:
45 </p><pre class="programlisting">
46 EXPLAIN SELECT * FROM tenk1 WHERE unique1 < 1000;
49 --------------------------------------------------------------------------------
50 Bitmap Heap Scan on tenk1 (cost=24.06..394.64 rows=1007 width=244)
51 Recheck Cond: (unique1 < 1000)
52 -> Bitmap Index Scan on tenk1_unique1 (cost=0.00..23.80 rows=1007 width=0)
53 Index Cond: (unique1 < 1000)
56 The planner examines the <code class="literal">WHERE</code> clause condition
57 and looks up the selectivity function for the operator
58 <code class="literal"><</code> in <code class="structname">pg_operator</code>.
59 This is held in the column <code class="structfield">oprrest</code>,
60 and the entry in this case is <code class="function">scalarltsel</code>.
61 The <code class="function">scalarltsel</code> function retrieves the histogram for
62 <code class="structfield">unique1</code> from
63 <code class="structname">pg_statistic</code>. For manual queries it is more
64 convenient to look in the simpler <code class="structname">pg_stats</code>
67 </p><pre class="programlisting">
68 SELECT histogram_bounds FROM pg_stats
69 WHERE tablename='tenk1' AND attname='unique1';
72 ------------------------------------------------------
73 {0,993,1997,3050,4040,5036,5957,7057,8029,9016,9995}
76 Next the fraction of the histogram occupied by <span class="quote">“<span class="quote">< 1000</span>”</span>
77 is worked out. This is the selectivity. The histogram divides the range
78 into equal frequency buckets, so all we have to do is locate the bucket
79 that our value is in and count <span class="emphasis"><em>part</em></span> of it and
80 <span class="emphasis"><em>all</em></span> of the ones before. The value 1000 is clearly in
81 the second bucket (993–1997). Assuming a linear distribution of
82 values inside each bucket, we can calculate the selectivity as:
84 </p><pre class="programlisting">
85 selectivity = (1 + (1000 - bucket[2].min)/(bucket[2].max - bucket[2].min))/num_buckets
86 = (1 + (1000 - 993)/(1997 - 993))/10
90 that is, one whole bucket plus a linear fraction of the second, divided by
91 the number of buckets. The estimated number of rows can now be calculated as
92 the product of the selectivity and the cardinality of
93 <code class="structname">tenk1</code>:
95 </p><pre class="programlisting">
96 rows = rel_cardinality * selectivity
101 Next let's consider an example with an equality condition in its
102 <code class="literal">WHERE</code> clause:
104 </p><pre class="programlisting">
105 EXPLAIN SELECT * FROM tenk1 WHERE stringu1 = 'CRAAAA';
108 ----------------------------------------------------------
109 Seq Scan on tenk1 (cost=0.00..483.00 rows=30 width=244)
110 Filter: (stringu1 = 'CRAAAA'::name)
113 Again the planner examines the <code class="literal">WHERE</code> clause condition
114 and looks up the selectivity function for <code class="literal">=</code>, which is
115 <code class="function">eqsel</code>. For equality estimation the histogram is
116 not useful; instead the list of <em class="firstterm">most
117 common values</em> (<acronym class="acronym">MCV</acronym>s) is used to determine the
118 selectivity. Let's have a look at the MCVs, with some additional columns
119 that will be useful later:
121 </p><pre class="programlisting">
122 SELECT null_frac, n_distinct, most_common_vals, most_common_freqs FROM pg_stats
123 WHERE tablename='tenk1' AND attname='stringu1';
127 most_common_vals | {EJAAAA,BBAAAA,CRAAAA,FCAAAA,FEAAAA,GSAAAA,JOAAAA,MCAAAA,NAAAAA,WGAAAA}
128 most_common_freqs | {0.00333333,0.003,0.003,0.003,0.003,0.003,0.003,0.003,0.003,0.003}
132 Since <code class="literal">CRAAAA</code> appears in the list of MCVs, the selectivity is
133 merely the corresponding entry in the list of most common frequencies
134 (<acronym class="acronym">MCF</acronym>s):
136 </p><pre class="programlisting">
141 As before, the estimated number of rows is just the product of this with the
142 cardinality of <code class="structname">tenk1</code>:
144 </p><pre class="programlisting">
149 Now consider the same query, but with a constant that is not in the
150 <acronym class="acronym">MCV</acronym> list:
152 </p><pre class="programlisting">
153 EXPLAIN SELECT * FROM tenk1 WHERE stringu1 = 'xxx';
156 ----------------------------------------------------------
157 Seq Scan on tenk1 (cost=0.00..483.00 rows=15 width=244)
158 Filter: (stringu1 = 'xxx'::name)
161 This is quite a different problem: how to estimate the selectivity when the
162 value is <span class="emphasis"><em>not</em></span> in the <acronym class="acronym">MCV</acronym> list.
163 The approach is to use the fact that the value is not in the list,
164 combined with the knowledge of the frequencies for all of the
165 <acronym class="acronym">MCV</acronym>s:
167 </p><pre class="programlisting">
168 selectivity = (1 - sum(mcv_freqs))/(num_distinct - num_mcv)
169 = (1 - (0.00333333 + 0.003 + 0.003 + 0.003 + 0.003 + 0.003 +
170 0.003 + 0.003 + 0.003 + 0.003))/(676 - 10)
174 That is, add up all the frequencies for the <acronym class="acronym">MCV</acronym>s and
175 subtract them from one, then
176 divide by the number of <span class="emphasis"><em>other</em></span> distinct values.
177 This amounts to assuming that the fraction of the column that is not any
178 of the MCVs is evenly distributed among all the other distinct values.
179 Notice that there are no null values so we don't have to worry about those
180 (otherwise we'd subtract the null fraction from the numerator as well).
181 The estimated number of rows is then calculated as usual:
183 </p><pre class="programlisting">
184 rows = 10000 * 0.0014559
188 The previous example with <code class="literal">unique1 < 1000</code> was an
189 oversimplification of what <code class="function">scalarltsel</code> really does;
190 now that we have seen an example of the use of MCVs, we can fill in some
191 more detail. The example was correct as far as it went, because since
192 <code class="structfield">unique1</code> is a unique column it has no MCVs (obviously, no
193 value is any more common than any other value). For a non-unique
194 column, there will normally be both a histogram and an MCV list, and
195 <span class="emphasis"><em>the histogram does not include the portion of the column
196 population represented by the MCVs</em></span>. We do things this way because
197 it allows more precise estimation. In this situation
198 <code class="function">scalarltsel</code> directly applies the condition (e.g.,
199 <span class="quote">“<span class="quote">< 1000</span>”</span>) to each value of the MCV list, and adds up the
200 frequencies of the MCVs for which the condition is true. This gives
201 an exact estimate of the selectivity within the portion of the table
202 that is MCVs. The histogram is then used in the same way as above
203 to estimate the selectivity in the portion of the table that is not
204 MCVs, and then the two numbers are combined to estimate the overall
205 selectivity. For example, consider
207 </p><pre class="programlisting">
208 EXPLAIN SELECT * FROM tenk1 WHERE stringu1 < 'IAAAAA';
211 ------------------------------------------------------------
212 Seq Scan on tenk1 (cost=0.00..483.00 rows=3077 width=244)
213 Filter: (stringu1 < 'IAAAAA'::name)
216 We already saw the MCV information for <code class="structfield">stringu1</code>,
217 and here is its histogram:
219 </p><pre class="programlisting">
220 SELECT histogram_bounds FROM pg_stats
221 WHERE tablename='tenk1' AND attname='stringu1';
224 --------------------------------------------------------------------------------
225 {AAAAAA,CQAAAA,FRAAAA,IBAAAA,KRAAAA,NFAAAA,PSAAAA,SGAAAA,VAAAAA,XLAAAA,ZZAAAA}
228 Checking the MCV list, we find that the condition <code class="literal">stringu1 <
229 'IAAAAA'</code> is satisfied by the first six entries and not the last four,
230 so the selectivity within the MCV part of the population is
232 </p><pre class="programlisting">
233 selectivity = sum(relevant mvfs)
234 = 0.00333333 + 0.003 + 0.003 + 0.003 + 0.003 + 0.003
238 Summing all the MCFs also tells us that the total fraction of the
239 population represented by MCVs is 0.03033333, and therefore the
240 fraction represented by the histogram is 0.96966667 (again, there
241 are no nulls, else we'd have to exclude them here). We can see
242 that the value <code class="literal">IAAAAA</code> falls nearly at the end of the
243 third histogram bucket. Using some rather cheesy assumptions
244 about the frequency of different characters, the planner arrives
245 at the estimate 0.298387 for the portion of the histogram population
246 that is less than <code class="literal">IAAAAA</code>. We then combine the estimates
247 for the MCV and non-MCV populations:
249 </p><pre class="programlisting">
250 selectivity = mcv_selectivity + histogram_selectivity * histogram_fraction
251 = 0.01833333 + 0.298387 * 0.96966667
254 rows = 10000 * 0.307669
255 = 3077 (rounding off)
258 In this particular example, the correction from the MCV list is fairly
259 small, because the column distribution is actually quite flat (the
260 statistics showing these particular values as being more common than
261 others are mostly due to sampling error). In a more typical case where
262 some values are significantly more common than others, this complicated
263 process gives a useful improvement in accuracy because the selectivity
264 for the most common values is found exactly.
266 Now let's consider a case with more than one
267 condition in the <code class="literal">WHERE</code> clause:
269 </p><pre class="programlisting">
270 EXPLAIN SELECT * FROM tenk1 WHERE unique1 < 1000 AND stringu1 = 'xxx';
273 --------------------------------------------------------------------------------
274 Bitmap Heap Scan on tenk1 (cost=23.80..396.91 rows=1 width=244)
275 Recheck Cond: (unique1 < 1000)
276 Filter: (stringu1 = 'xxx'::name)
277 -> Bitmap Index Scan on tenk1_unique1 (cost=0.00..23.80 rows=1007 width=0)
278 Index Cond: (unique1 < 1000)
281 The planner assumes that the two conditions are independent, so that
282 the individual selectivities of the clauses can be multiplied together:
284 </p><pre class="programlisting">
285 selectivity = selectivity(unique1 < 1000) * selectivity(stringu1 = 'xxx')
286 = 0.100697 * 0.0014559
289 rows = 10000 * 0.0001466
293 Notice that the number of rows estimated to be returned from the bitmap
294 index scan reflects only the condition used with the index; this is
295 important since it affects the cost estimate for the subsequent heap
298 Finally we will examine a query that involves a join:
300 </p><pre class="programlisting">
301 EXPLAIN SELECT * FROM tenk1 t1, tenk2 t2
302 WHERE t1.unique1 < 50 AND t1.unique2 = t2.unique2;
305 --------------------------------------------------------------------------------------
306 Nested Loop (cost=4.64..456.23 rows=50 width=488)
307 -> Bitmap Heap Scan on tenk1 t1 (cost=4.64..142.17 rows=50 width=244)
308 Recheck Cond: (unique1 < 50)
309 -> Bitmap Index Scan on tenk1_unique1 (cost=0.00..4.63 rows=50 width=0)
310 Index Cond: (unique1 < 50)
311 -> Index Scan using tenk2_unique2 on tenk2 t2 (cost=0.00..6.27 rows=1 width=244)
312 Index Cond: (unique2 = t1.unique2)
315 The restriction on <code class="structname">tenk1</code>,
316 <code class="literal">unique1 < 50</code>,
317 is evaluated before the nested-loop join.
318 This is handled analogously to the previous range example. This time the
319 value 50 falls into the first bucket of the
320 <code class="structfield">unique1</code> histogram:
322 </p><pre class="programlisting">
323 selectivity = (0 + (50 - bucket[1].min)/(bucket[1].max - bucket[1].min))/num_buckets
324 = (0 + (50 - 0)/(993 - 0))/10
327 rows = 10000 * 0.005035
331 The restriction for the join is <code class="literal">t2.unique2 = t1.unique2</code>.
333 our familiar <code class="literal">=</code>, however the selectivity function is
334 obtained from the <code class="structfield">oprjoin</code> column of
335 <code class="structname">pg_operator</code>, and is <code class="function">eqjoinsel</code>.
336 <code class="function">eqjoinsel</code> looks up the statistical information for both
337 <code class="structname">tenk2</code> and <code class="structname">tenk1</code>:
339 </p><pre class="programlisting">
340 SELECT tablename, null_frac,n_distinct, most_common_vals FROM pg_stats
341 WHERE tablename IN ('tenk1', 'tenk2') AND attname='unique2';
343 tablename | null_frac | n_distinct | most_common_vals
344 -----------+-----------+------------+------------------
349 In this case there is no <acronym class="acronym">MCV</acronym> information for
350 <code class="structname">unique2</code> and all the values appear to be
351 unique (n_distinct = -1), so we use an algorithm that relies on the row
352 count estimates for both relations (num_rows, not shown, but "tenk")
353 together with the column null fractions (zero for both):
355 </p><pre class="programlisting">
356 selectivity = (1 - null_frac1) * (1 - null_frac2) / max(num_rows1, num_rows2)
357 = (1 - 0) * (1 - 0) / max(10000, 10000)
361 This is, subtract the null fraction from one for each of the relations,
362 and divide by the row count of the larger relation (this value does get
363 scaled in the non-unique case).
365 that the join is likely to emit is calculated as the cardinality of the
366 Cartesian product of the two inputs, multiplied by the
369 </p><pre class="programlisting">
370 rows = (outer_cardinality * inner_cardinality) * selectivity
371 = (50 * 10000) * 0.0001
375 Had there been MCV lists for the two columns,
376 <code class="function">eqjoinsel</code> would have used direct comparison of the MCV
377 lists to determine the join selectivity within the part of the column
378 populations represented by the MCVs. The estimate for the remainder of the
379 populations follows the same approach shown here.
381 Notice that we showed <code class="literal">inner_cardinality</code> as 10000, that is,
382 the unmodified size of <code class="structname">tenk2</code>. It might appear from
383 inspection of the <code class="command">EXPLAIN</code> output that the estimate of
384 join rows comes from 50 * 1, that is, the number of outer rows times
385 the estimated number of rows obtained by each inner index scan on
386 <code class="structname">tenk2</code>. But this is not the case: the join relation size
387 is estimated before any particular join plan has been considered. If
388 everything is working well then the two ways of estimating the join
389 size will produce about the same answer, but due to round-off error and
390 other factors they sometimes diverge significantly.
392 For those interested in further details, estimation of the size of
393 a table (before any <code class="literal">WHERE</code> clauses) is done in
394 <code class="filename">src/backend/optimizer/util/plancat.c</code>. The generic
395 logic for clause selectivities is in
396 <code class="filename">src/backend/optimizer/path/clausesel.c</code>. The
397 operator-specific selectivity functions are mostly found
398 in <code class="filename">src/backend/utils/adt/selfuncs.c</code>.
399 </p></div><div class="navfooter"><hr /><table width="100%" summary="Navigation footer"><tr><td width="40%" align="left"><a accesskey="p" href="planner-stats-details.html" title="Chapter 69. How the Planner Uses Statistics">Prev</a> </td><td width="20%" align="center"><a accesskey="u" href="planner-stats-details.html" title="Chapter 69. How the Planner Uses Statistics">Up</a></td><td width="40%" align="right"> <a accesskey="n" href="multivariate-statistics-examples.html" title="69.2. Multivariate Statistics Examples">Next</a></td></tr><tr><td width="40%" align="left" valign="top">Chapter 69. How the Planner Uses Statistics </td><td width="20%" align="center"><a accesskey="h" href="index.html" title="PostgreSQL 18.0 Documentation">Home</a></td><td width="40%" align="right" valign="top"> 69.2. Multivariate Statistics Examples</td></tr></table></div></body></html>