Age | Commit message (Collapse) | Author |
|
Word boundaries are based on whether a character is alphanumeric or
not. For the PG_UNICODE_FAST collation, alphanumeric includes
non-ASCII digits; whereas for the PG_C_UTF8 collation, it only
includes digits 0-9. Pass down the right information from the
pg_locale_t into initcap_wbnext to differentiate the behavior.
Reported-by: Noah Misch <noah@leadboat.com>
Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/20250417135841.33.nmisch@google.com
|
|
Reviewed-by: Jeff Davis <pgsql@j-davis.com>
Discussion: https://www.postgresql.org/message-id/flat/146349e4-4687-4321-91af-f235572490a8@eisentraut.org
|
|
Commit 4e7f62bc386 added a new input file to a script but didn't
update the comment listing the input files.
|
|
The addition of SpecialCasing.txt by commit 286a365b9c2 was not added
to the make target dependencies, so the invoked script would fail
because the required file wasn't downloaded first. (The meson version
appears to work correctly.)
|
|
Reported-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/93731.1742310701@sss.pgh.pa.us
|
|
Commit 27bdec06841 uses a loop variable that is not strictly local to
the loop. Perlcritic disapproves, and there's really no reason as the
variable is not used outside the loop.
Per buildfarm animals koel and crake.
|
|
Improve performance and reduce table sizes for case mapping.
The main case mapping table stores only 16-bit offsets, which can be
used to look up the mapped code point in any of the case tables (fold,
lower, upper, or title case). Simple case pairs point to the same
offsets.
Generate a function in generate-unicode_case_table.pl that consists of
a nested branches to test for specific codepoint ranges that determine
the offset in the main table.
Other approaches were considered, such as representing these ranges as
another structure (rather than branches in a generated function), or a
different approach such as a radix tree, or perfect hashing. The
author implemented and tested these alternatives and settled on the
generated branches.
Author: Alexander Borisov <lex.borisov@gmail.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/7cac7e66-9a3b-4e3f-a997-42aa0c401f80%40gmail.com
|
|
Expand case mapping tables to include entries for case folding, which
are parsed from CaseFolding.txt.
Discussion: https://postgr.es/m/a1886ddfcd8f60cb3e905c93009b646b4cfb74c5.camel%40j-davis.com
|
|
Generate tables from Unicode SpecialCasing.txt to support more
sophisticated case mapping behavior:
* support case mappings to multiple codepoints, such as "ß"
uppercasing to "SS"
* support conditional case mappings, such as the "final sigma"
* support titlecase variants, such as "dž" uppercasing to "DŽ" but
titlecasing to "Dž"
Discussion: https://postgr.es/m/ddfd67928818f138f51635712529bc5e1d25e4e7.camel@j-davis.com
Discussion: https://postgr.es/m/27bb0e52-801d-4f73-a0a4-02cfdd4a9ada@eisentraut.org
Reviewed-by: Peter Eisentraut, Daniel Verite
|
|
Backpatch-through: 13
|
|
Most came in during the 17 cycle, so backpatch there. Some
(particularly reorderbuffer.h) are very old, but backpatching doesn't
seem useful.
Like commits c9d297751959, c4f113e8fef9.
|
|
The test programs in src/common/unicode/ (case_test, category_test,
norm_test), don't build with meson if the nls option is enabled,
because a libintl dependency is missing. Fix that. (The makefiles
are ok.)
Reviewed-by: Aleksander Alekseev <aleksander@timescale.com>
Discussion: https://www.postgresql.org/message-id/flat/52db1d2b-4b96-473e-b323-a4b16a950fba%40eisentraut.org
|
|
When running the 'update-unicode' build target, generate files that
conform to pgindent whitespace rules.
|
|
Apply c5385929593 to new Perl files that had missed the note.
|
|
A few patches that were written <= 2023 and committed in 2024 still
contained 2023 copyright year. Fix that.
Discussion: https://postgr.es/m/CAApHDvr5egyW3FmHbAg-Uq2p_Aizwco1Zjs6Vbq18KqN64-hRA@mail.gmail.com
|
|
for new downloaded files and new build results.
|
|
|
|
Check source length before checking for NUL terminator to avoid
reading one byte past the string end. Also fix unreachable bug when
caller does not expect NUL-terminated result.
Add unit test coverage of convert_case() in case_test.c, which makes
it easier to reproduce the valgrind failure.
Discussion: https://postgr.es/m/7a9fd36d-7a38-4dc2-e676-fc939491a95a@gmail.com
Reported-by: Alexander Lakhin
|
|
Change the test description to include the case mapping
test. Oversight in 5c40364dd6.
|
|
Implements Unicode simple case mapping, in which all code points map
to exactly one other code point unconditionally.
These tables are generated from UnicodeData.txt, which is already
being used by other infrastructure in src/common/unicode. The tables
are checked into the source tree, so they only need to be regenerated
when we update the Unicode version.
In preparation for the builtin collation provider, and possibly useful
for other callers.
Discussion: https://postgr.es/m/ff4c2f2f9c8fc7ca27c1c24ae37ecaeaeaff6b53.camel%40j-davis.com
Reviewed-by: Peter Eisentraut, Daniel Verite, Jeremy Schneider
|
|
Provide functions to test for Unicode properties, such as Alphabetic
or Cased. These functions use tables derived from Unicode data files,
similar to the tables for Unicode normalization or general category,
and those tables can be updated with the 'update-unicode' build
target.
Use Unicode properties to provide functions to test for regex
character classes, like 'punct' or 'alnum'.
Infrastructure in preparation for a builtin collation provider, and
may also be useful for other callers.
Discussion: https://postgr.es/m/ff4c2f2f9c8fc7ca27c1c24ae37ecaeaeaff6b53.camel%40j-davis.com
Reviewed-by: Daniel Verite, Peter Eisentraut, Jeremy Schneider
|
|
In preparation for adding more Unicode tables.
Discussion: https://postgr.es/m/63cd8625-68fa-4760-844a-6b7f643336f2@ardentperf.com
Reviewed-by: Jeremy Schneider
|
|
Reported-by: Michael Paquier
Discussion: https://postgr.es/m/ZZKTDPxBBMt3C0J9@paquier.xyz
Backpatch-through: 12
|
|
There are a lot of Perl scripts in the tree, mostly code generation
and TAP tests. Occasionally, these scripts produce warnings. These
are probably always mistakes on the developer side (true positives).
Typical examples are warnings from genbki.pl or related when you make
a mess in the catalog files during development, or warnings from tests
when they massage a config file that looks different on different
hosts, or mistakes during merges (e.g., duplicate subroutine
definitions), or just mistakes that weren't noticed because there is a
lot of output in a verbose build.
This changes all warnings into fatal errors, by replacing
use warnings;
by
use warnings FATAL => 'all';
in all Perl files.
Discussion: https://www.postgresql.org/message-id/flat/06f899fd-1826-05ab-42d6-adeb1fd5e200%40eisentraut.org
|
|
Missing entries can implicitly be considered "unassigned".
Discussion: https://postgr.es/m/ff4c2f2f9c8fc7ca27c1c24ae37ecaeaeaff6b53.camel@j-davis.com
|
|
A PostgreSQL release tarball contains a number of prebuilt files, in
particular files produced by bison, flex, perl, and well as html and
man documentation. We have done this consistent with established
practice at the time to not require these tools for building from a
tarball. Some of these tools were hard to get, or get the right
version of, from time to time, and shipping the prebuilt output was a
convenience to users.
Now this has at least two problems:
One, we have to make the build system(s) work in two modes: Building
from a git checkout and building from a tarball. This is pretty
complicated, but it works so far for autoconf/make. It does not
currently work for meson; you can currently only build with meson from
a git checkout. Making meson builds work from a tarball seems very
difficult or impossible. One particular problem is that since meson
requires a separate build directory, we cannot make the build update
files like gram.h in the source tree. So if you were to build from a
tarball and update gram.y, you will have a gram.h in the source tree
and one in the build tree, but the way things work is that the
compiler will always use the one in the source tree. So you cannot,
for example, make any gram.y changes when building from a tarball.
This seems impossible to fix in a non-horrible way.
Second, there is increased interest nowadays in precisely tracking the
origin of software. We can reasonably track contributions into the
git tree, and users can reasonably track the path from a tarball to
packages and downloads and installs. But what happens between the git
tree and the tarball is obscure and in some cases non-reproducible.
The solution for both of these issues is to get rid of the step that
adds prebuilt files to the tarball. The tarball now only contains
what is in the git tree (*). Getting the additional build
dependencies is no longer a problem nowadays, and the complications to
keep these dual build modes working are significant. And of course we
want to get the meson build system working universally.
This commit removes the make distprep target altogether. The make
dist target continues to do its job, it just doesn't call distprep
anymore.
(*) - The tarball also contains the INSTALL file that is built at make
dist time, but not by distprep. This is unchanged for now.
The make maintainer-clean target, whose job it is to remove the
prebuilt files in addition to what make distclean does, is now just an
alias to make distprep. (In practice, it is probably obsolete given
that git clean is available.)
The following programs are now hard build requirements in configure
(they were already required by meson.build):
- bison
- flex
- perl
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://www.postgresql.org/message-id/flat/e07408d9-e5f2-d9fd-5672-f53354e9305e@eisentraut.org
|
|
Introduce unicode_version(), icu_unicode_version(), and
unicode_assigned().
The latter requires introducing a new lookup table for the Unicode
General Category, which is generated along with the other Unicode
lookup tables.
Discussion: https://postgr.es/m/CA+TgmoYzYR-yhU6k1XFCADeyj=Oyz2PkVsa3iKv+keM8wp-F_A@mail.gmail.com
Reviewed-by: Peter Eisentraut
|
|
|
|
Between Unicode 15.0.0 and 15.1.0, the whitespace in
EastAsianWidth.txt has changed a bit, such as from
0020;Na # Zs SPACE
to
0020 ; Na # Zs SPACE
with space around the semicolon. Adjust the script to be able to
parse that.
|
|
Eventually it is likely worth trying to deal with this in a more expansive
way, by generating dependency files generated within the scripts. But it's not
entirely obvious how to do that in perl and is work more suitable for 17
anyway.
Reported-by: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org>
Reviewed-by: Tristan Partin <tristan@neon.tech>
Discussion: https://postgr.es/m/87v8g7s6bf.fsf@wibble.ilmari.org
|
|
Run pgindent, pgperltidy, and reformat-dat-files.
This set of diffs is a bit larger than typical. We've updated to
pg_bsd_indent 2.1.2, which properly indents variable declarations that
have multi-line initialization expressions (the continuation lines are
now indented one tab stop). We've also updated to perltidy version
20230309 and changed some of its settings, which reduces its desire to
add whitespace to lines to make assignments etc. line up. Going
forward, that should make for fewer random-seeming changes to existing
code.
Discussion: https://postgr.es/m/20230428092545.qfb3y5wcu4cm75ur@alvherre.pgsql
|
|
We already didn't use touch (some earlier version of the meson build did ),
and cp is only used for updating unicode files. The latter already depends on
the optional availability of 'wget', so doing the same for 'cp' makes sense.
Eventually we probably want a portable command for updating source code as
part of a target, but for now...
Reported-by: Andrew Dunstan <andrew@dunslane.net>
Discussion: https://postgr.es/m/70e96c34-64ee-e549-8c4a-f91a7a668804@dunslane.net
|
|
Backpatch-through: 11
|
|
Discussion: https://postgr.es/m/222b43a5-2fb3-2c1b-9cd0-375d376c8246@dunslane.net
|
|
Autoconf is showing its age, fewer and fewer contributors know how to wrangle
it. Recursive make has a lot of hard to resolve dependency issues and slow
incremental rebuilds. Our home-grown MSVC build system is hard to maintain for
developers not using Windows and runs tests serially. While these and other
issues could individually be addressed with incremental improvements, together
they seem best addressed by moving to a more modern build system.
After evaluating different build system choices, we chose to use meson, to a
good degree based on the adoption by other open source projects.
We decided that it's more realistic to commit a relatively early version of
the new build system and mature it in tree.
This commit adds an initial version of a meson based build system. It supports
building postgres on at least AIX, FreeBSD, Linux, macOS, NetBSD, OpenBSD,
Solaris and Windows (however only gcc is supported on aix, solaris). For
Windows/MSVC postgres can now be built with ninja (faster, particularly for
incremental builds) and msbuild (supporting the visual studio GUI, but
building slower).
Several aspects (e.g. Windows rc file generation, PGXS compatibility, LLVM
bitcode generation, documentation adjustments) are done in subsequent commits
requiring further review. Other aspects (e.g. not installing test-only
extensions) are not yet addressed.
When building on Windows with msbuild, builds are slower when using a visual
studio version older than 2019, because those versions do not support
MultiToolTask, required by meson for intra-target parallelism.
The plan is to remove the MSVC specific build system in src/tools/msvc soon
after reaching feature parity. However, we're not planning to remove the
autoconf/make build system in the near future. Likely we're going to keep at
least the parts required for PGXS to keep working around until all supported
versions build with meson.
Some initial help for postgres developers is at
https://wiki.postgresql.org/wiki/Meson
With contributions from Thomas Munro, John Naylor, Stone Tickle and others.
Author: Andres Freund <andres@anarazel.de>
Author: Nazir Bilal Yavuz <byavuz81@gmail.com>
Author: Peter Eisentraut <peter@eisentraut.org>
Reviewed-By: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Discussion: https://postgr.es/m/20211012083721.hvixq4pnh2pixr3j@alap3.anarazel.de
|
|
Commit d8594d123 updated the list of non-spacing codepoints used
for calculating display width, but in doing so inadvertently removed
some, since the script used for that commit only considered combining
characters.
For complete coverage for zero-width characters, include codepoints in
the category Cf (Format). To reflect the wider purpose, also rename files
and update comments that referred specifically to combining characters.
Some of these ranges have been missing since v12, but due to lack of
field complaints it was determined not important enough to justify adding
special-case logic the backbranches.
Kyotaro Horiguchi
Report by Pavel Stehule
Discussion: https://www.postgresql.org/message-id/flat/CAFj8pRBE8yvpQ0FSkPCoe0Ny1jAAsAQ6j3qMgVwWvkqAoaaNmQ%40mail.gmail.com
|
|
This is in preparation for building postgres with meson / ninja.
When building with meson, commands are run at the root of the build tree. Add
an option to put build output into the appropriate place.
Author: Andres Freund <andres@anarazel.de>
Author: Peter Eisentraut <peter@eisentraut.org>
Author: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/5e216522-ba3c-f0e6-7f97-5276d0270029@enterprisedb.com
|
|
Author: Andres Freund <andres@anarazel.de>
Discussion: https://www.postgresql.org/message-id/616c6873-83b5-85c0-93cb-548977c39c60@enterprisedb.com
|
|
Backpatch-through: 10
|
|
The former limit was perhaps a carryover from an older hand-coded
table. Since commit bab982161 we have enough space in mbinterval to
store larger codepoints, so collect all combining characters.
Discussion: https://www.postgresql.org/message-id/49ad1fa0-174e-c901-b14c-c484b60907f1%40enterprisedb.com
|
|
The hardcoded "wide character" set in ucs_wcwidth() was last updated
around the Unicode 5.0 era. This led to misalignment when printing
emojis and other codepoints that have since been designated
wide or full-width.
To fix and keep up to date, extend update-unicode to download the list
of wide and full-width codepoints from the offical sources.
In passing, remove some comments about non-spacing characters that
haven't been accurate since we removed the former hardcoded logic.
Jacob Champion
Reported and reviewed by Pavel Stehule
Discussion: https://www.postgresql.org/message-id/flat/CAFj8pRCeX21O69YHxmykYySYyprZAqrKWWg0KoGKdjgqcGyygg@mail.gmail.com
|
|
This reverts commit eb0d0d2c7300c9c5c22b35975c11265aa4becc84.
After I had committed eb0d0d2c7 and 78ab944cd, I decided to add
a sanity check for a "can't happen" scenario just to be cautious.
It turned out that it already happened in the official Unicode source
data, namely that a character can be both wide and a combining
character. This fact renders the aforementioned commits unnecessary,
so revert both of them.
Discussion: https://www.postgresql.org/message-id/CAFBsxsH5ejH4-1xaTLpSK8vWoK1m6fA1JBtTM6jmBsLfmDki1g%40mail.gmail.com
|
|
This reverts commit 78ab944cd4b9977732becd9d0bc83223b88af9a2.
After I had committed eb0d0d2c7 and 78ab944cd, I decided to add
a sanity check for a "can't happen" scenario just to be cautious.
It turned out that it already happened in the official Unicode source
data, namely that a character can be both wide and a combining
character. This fact renders the aforementioned commits unnecessary,
so revert both of them.
Discussion:
https://www.postgresql.org/message-id/CAFBsxsH5ejH4-1xaTLpSK8vWoK1m6fA1JBtTM6jmBsLfmDki1g%40mail.gmail.com
|
|
Add a width field to mbinterval and have mbbisearch return a
pointer to the found range rather than just bool for success.
A future commit will add another width besides zero, and this
will allow that to use the same search.
Reviewed by Jacob Champion
Discussion: https://www.postgresql.org/message-id/CAFBsxsGOCpzV7c-f3a8ADsA1n4uZ%3D8puCctQp%2Bx7W0vgkv%3Dw%2Bg%40mail.gmail.com
|
|
No functional changes. A future commit will use this table for
other purposes besides combining characters.
|
|
Backpatch-through: 9.5
|
|
The table generation script would incorrectly complain in the
recomposition sorting when matching code points. This would not have
caused the generation of an incorrect table. Note that this condition
is not reachable yet, but could have been reached with future updates.
pg_bswap.h does not need to be included in the frontend.x
Author: John Naylor
Discussion: https://postgr.es/m/CAFBsxsGWmExpvv=61vtDKCs7+kBbhkwBDL2Ph9CacziFKnV_yw@mail.gmail.com
|
|
This replaces the existing binary search with two perfect hash functions
for the composition and the decomposition in the backend code, at the
cost of slightly-larger binaries there (35kB in libpgcommon_srv.a). Per
the measurements done, this improves the speed of the recomposition and
decomposition by up to 30~40 times for the NFC and NFKC conversions,
while all other operations get at least 40% faster. This is not as
"good" as what libicu has, but it closes the gap a lot as per the
feedback from Daniel Verite.
The decomposition table remains the same, getting used for the binary
search in the frontend code, where we care more about the size of the
libraries like libpq over performance as this gets involved only in code
paths related to the SCRAM authentication. In consequence, note that
the perfect hash function for the recomposition needs to use a new
inverse lookup array back to to the existing decomposition table.
The size of all frontend deliverables remains unchanged, even with
--enable-debug, including libpq.
Author: John Naylor
Reviewed-by: Michael Paquier, Tom Lane
Discussion: https://postgr.es/m/CAFBsxsHUuMFCt6-pU+oG-F1==CmEp8wR+O+bRouXWu6i8kXuqA@mail.gmail.com
|
|
This makes the normalization quick check about 30% faster for NFC and
50% faster for NFKC than the binary search used previously. The hash
lookup reuses the existing array of bit fields used for the binary
search to get the quick check property and is generated as part of "make
update-unicode" in src/common/unicode/.
Author: John Naylor
Reviewed-by: Mark Dilger, Michael Paquier
Discussion: https://postgr.es/m/CACPNZCt4fbJ0_bGrN5QPt34N4whv=mszM0LMVQdoa2rC9UMRXA@mail.gmail.com
|
|
Includes some manual cleanup of places that pgindent messed up,
most of which weren't per project style anyway.
Notably, it seems some people didn't absorb the style rules of
commit c9d297751, because there were a bunch of new occurrences
of function calls with a newline just after the left paren, all
with faulty expectations about how the rest of the call would get
indented.
|