From 7d08ce286cd5854d58152e428c28636a616bdc42 Mon Sep 17 00:00:00 2001 From: Tom Lane Date: Wed, 13 Sep 2017 11:12:39 -0400 Subject: Distinguish selectivity of < from <= and > from >=. Historically, the selectivity functions have simply not distinguished < from <=, or > from >=, arguing that the fraction of the population that satisfies the "=" aspect can be considered to be vanishingly small, if the comparison value isn't any of the most-common-values for the variable. (If it is, the code path that executes the operator against each MCV will take care of things properly.) But that isn't really true unless we're dealing with a continuum of variable values, and in practice we seldom are. If "x = const" would estimate a nonzero number of rows for a given const value, then it follows that we ought to estimate different numbers of rows for "x < const" and "x <= const", even if the const is not one of the MCVs. Handling this more honestly makes a significant difference in edge cases, such as the estimate for a tight range (x BETWEEN y AND z where y and z are close together). Hence, split scalarltsel into scalarltsel/scalarlesel, and similarly split scalargtsel into scalargtsel/scalargesel. Adjust <= and >= operator definitions to reference the new selectivity functions. Improve the core ineq_histogram_selectivity() function to make a correction for equality. (Along the way, I learned quite a bit about exactly why that function gives good answers, which I tried to memorialize in improved comments.) The corresponding join selectivity functions were, and remain, just stubs. But I chose to split them similarly, to avoid confusion and to prevent the need for doing this exercise again if someone ever makes them less stubby. In passing, change ineq_histogram_selectivity's clamp for extreme probability estimates so that it varies depending on the histogram size, instead of being hardwired at 0.0001. With the default histogram size of 100 entries, you still get the old clamp value, but bigger histograms should allow us to put more faith in edge values. Tom Lane, reviewed by Aleksander Alekseev and Kuntal Ghosh Discussion: https://postgr.es/m/12232.1499140410@sss.pgh.pa.us --- doc/src/sgml/xindex.sgml | 3 +-- doc/src/sgml/xoper.sgml | 35 +++++++++++++++-------------------- 2 files changed, 16 insertions(+), 22 deletions(-) (limited to 'doc/src') diff --git a/doc/src/sgml/xindex.sgml b/doc/src/sgml/xindex.sgml index 745b4d5619a..b951a58e0ab 100644 --- a/doc/src/sgml/xindex.sgml +++ b/doc/src/sgml/xindex.sgml @@ -801,8 +801,7 @@ CREATE OPERATOR < ( It is important to specify the correct commutator and negator operators, as well as suitable restriction and join selectivity functions, otherwise the optimizer will be unable to make effective - use of the index. Note that the less-than, equal, and - greater-than cases should use different selectivity functions. + use of the index. diff --git a/doc/src/sgml/xoper.sgml b/doc/src/sgml/xoper.sgml index 8568e21216b..d484d80105e 100644 --- a/doc/src/sgml/xoper.sgml +++ b/doc/src/sgml/xoper.sgml @@ -242,20 +242,11 @@ column OP constant eqsel for = neqsel for <> - scalarltsel for < or <= - scalargtsel for > or >= - - It might seem a little odd that these are the categories, but they - make sense if you think about it. = will typically accept only - a small fraction of the rows in a table; <> will typically reject - only a small fraction. < will accept a fraction that depends on - where the given constant falls in the range of values for that table - column (which, it just so happens, is information collected by - ANALYZE and made available to the selectivity estimator). - <= will accept a slightly larger fraction than < for the same - comparison constant, but they're close enough to not be worth - distinguishing, especially since we're not likely to do better than a - rough guess anyhow. Similar remarks apply to > and >=. + scalarltsel for < + scalarlesel for <= + scalargtsel for > + scalargesel for >= + @@ -267,10 +258,12 @@ column OP constant - You can use scalarltsel and scalargtsel for comparisons on data types that - have some sensible means of being converted into numeric scalars for - range comparisons. If possible, add the data type to those understood - by the function convert_to_scalar() in src/backend/utils/adt/selfuncs.c. + You can use scalarltsel, scalarlesel, + scalargtsel and scalargesel for comparisons on + data types that have some sensible means of being converted into numeric + scalars for range comparisons. If possible, add the data type to those + understood by the function convert_to_scalar() in + src/backend/utils/adt/selfuncs.c. (Eventually, this function should be replaced by per-data-type functions identified through a column of the pg_type system catalog; but that hasn't happened yet.) If you do not do this, things will still work, but the optimizer's @@ -310,8 +303,10 @@ table1.column1 OP table2.column2 eqjoinsel for = neqjoinsel for <> - scalarltjoinsel for < or <= - scalargtjoinsel for > or >= + scalarltjoinsel for < + scalarlejoinsel for <= + scalargtjoinsel for > + scalargejoinsel for >= areajoinsel for 2D area-based comparisons positionjoinsel for 2D position-based comparisons contjoinsel for 2D containment-based comparisons -- cgit v1.2.3