Fix misestimation of n_distinct for a nearly-unique column with many nulls.

If ANALYZE found no repeated non-null entries in its sample, it set the column's stadistinct value to -1.0, intending to indicate that the entries are all distinct. But what this value actually means is that the number of distinct values is 100% of the table's rowcount, and thus it was overestimating the number of distinct values by however many nulls there are. This could lead to very poor selectivity estimates, as for example in a recent report from Andreas Joseph Krogh. We should discount the stadistinct value by whatever we've estimated the nulls fraction to be. (That is what will happen if we choose to use a negative stadistinct for a column that does have repeated entries, so this code path was just inconsistent.) In addition to fixing the stadistinct entries stored by several different ANALYZE code paths, adjust the logic where get_variable_numdistinct() forces an "all distinct" estimate on the basis of finding a relevant unique index. Unique indexes don't reject nulls, so there's no reason to assume that the null fraction doesn't apply. Back-patch to all supported branches. Back-patching is a bit of a judgment call, but this problem seems to affect only a few users (else we'd have identified it long ago), and it's bad enough when it does happen that destabilizing plan choices in a worse direction seems unlikely. Patch by me, with documentation wording suggested by Dean Rasheed Report: <VisenaEmail.26.df42f82acae38a58.156463942b8@tc7-visena> Discussion: <16143.1470350371@sss.pgh.pa.us>
author: Tom Lane 2016-08-07 22:52:02 +0000
committer: Tom Lane 2016-08-07 22:52:02 +0000
commit: 127d73009a6d8ea7c17c6fe55458a2bfcf6a1593 (patch)
tree: 5c29b1cf892e44606a39b14d3230dbe7cf6d2349 /src/include
parent: 3e40d9227f98088afcaf4f00c7025c465dad3551 (diff)
1 files changed, 8 insertions, 7 deletions
diff --git a/src/include/catalog/pg_statistic.h b/src/include/catalog/pg_statistic.h
index 0d4f69a8e98..fc5d120e1e5 100644
--- a/src/include/catalog/pg_statistic.h
+++ b/src/include/catalog/pg_statistic.h
@@ -57,13 +57,14 @@ CATALOG(pg_statistic,2619) BKI_WITHOUT_OIDS
 	 *		> 0		actual number of distinct values
 	 *		< 0		negative of multiplier for number of rows
 	 * The special negative case allows us to cope with columns that are
-	 * unique (stadistinct = -1) or nearly so (for example, a column in
-	 * which values appear about twice on the average could be represented
-	 * by stadistinct = -0.5).  Because the number-of-rows statistic in
-	 * pg_class may be updated more frequently than pg_statistic is, it's
-	 * important to be able to describe such situations as a multiple of
-	 * the number of rows, rather than a fixed number of distinct values.
-	 * But in other cases a fixed number is correct (eg, a boolean column).
+	 * unique (stadistinct = -1) or nearly so (for example, a column in which
+	 * non-null values appear about twice on the average could be represented
+	 * by stadistinct = -0.5 if there are no nulls, or -0.4 if 20% of the
+	 * column is nulls).  Because the number-of-rows statistic in pg_class may
+	 * be updated more frequently than pg_statistic is, it's important to be
+	 * able to describe such situations as a multiple of the number of rows,
+	 * rather than a fixed number of distinct values.  But in other cases a
+	 * fixed number is correct (eg, a boolean column).
 	 * ----------------
 	 */
 	float4		stadistinct;
author	Tom Lane	2016-08-07 22:52:02 +0000
committer	Tom Lane	2016-08-07 22:52:02 +0000
commit	127d73009a6d8ea7c17c6fe55458a2bfcf6a1593 (patch)
tree	5c29b1cf892e44606a39b14d3230dbe7cf6d2349 /src/include
parent	3e40d9227f98088afcaf4f00c7025c465dad3551 (diff)