git.postgresql.org Git - postgresql.git/commit

author	Tom Lane <tgl@sss.pgh.pa.us>
	Tue, 1 Dec 2009 21:00:24 +0000 (21:00 +0000)
committer	Tom Lane <tgl@sss.pgh.pa.us>
	Tue, 1 Dec 2009 21:00:24 +0000 (21:00 +0000)
commit	0d32342501f2a562bc57156dc92d59a0624be4a6
tree	9039a0f5bdc634c1a7dfa99371160e51e1759168	tree
parent	ef51395e24c7452a9a50e3576b52fb64602f8cad	commit \| diff

Teach the regular expression functions to do case-insensitive matching and
locale-dependent character classification properly when the database encoding
is UTF8.

The previous coding worked okay in single-byte encodings, or in any case for
ASCII characters, but failed entirely on multibyte characters.  The fix
assumes that the <wctype.h> functions use Unicode code points as the wchar
representation for Unicode, ie, wchar matches pg_wchar.

This is only a partial solution, since we're still stupid about non-ASCII
characters in multibyte encodings other than UTF8.  The practical effect
of that is limited, however, since those cases are generally Far Eastern
glyphs for which concepts like case-folding don't apply anyway.  Certainly
all or nearly all of the field reports of problems have been about UTF8.
A more general solution would require switching to the platform's wchar
representation for all regex operations; which is possible but would have
substantial disadvantages.  Let's try this and see if it's sufficient in
practice.

src/backend/regex/regc_locale.c		diff \| blob \| blame \| history
src/include/regex/regcustom.h		diff \| blob \| blame \| history