summaryrefslogtreecommitdiff
path: root/src/include/mb
AgeCommit message (Collapse)Author
28 hoursRemove incorrect declarations in pg_wchar.h.Jeff Davis
Oversight in commit 9acae56ce0. Discussion: https://postgr.es/m/541F240E-94AD-4D65-9794-7D6C316BC3FF@gmail.com
2025-10-29Use C11 char16_t and char32_t for Unicode code points.Jeff Davis
Reviewed-by: Tatsuo Ishii <ishii@postgresql.org> Reviewed-by: Thomas Munro <thomas.munro@gmail.com> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Discussion: https://postgr.es/m/bedcc93d06203dfd89815b10f815ca2de8626e85.camel%40j-davis.com
2025-05-05With GB18030, prevent SIGSEGV from reading past end of allocation.Noah Misch
With GB18030 as source encoding, applications could crash the server via SQL functions convert() or convert_from(). Applications themselves could crash after passing unterminated GB18030 input to libpq functions PQescapeLiteral(), PQescapeIdentifier(), PQescapeStringConn(), or PQescapeString(). Extension code could crash by passing unterminated GB18030 input to jsonapi.h functions. All those functions have been intended to handle untrusted, unterminated input safely. A crash required allocating the input such that the last byte of the allocation was the last byte of a virtual memory page. Some malloc() implementations take measures against that, making the SIGSEGV hard to reach. Back-patch to v13 (all supported versions). Author: Noah Misch <noah@leadboat.com> Author: Andres Freund <andres@anarazel.de> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Backpatch-through: 13 Security: CVE-2025-4207
2025-03-13pg_noreturn to replace pg_attribute_noreturn()Peter Eisentraut
We want to support a "noreturn" decoration on more compilers besides just GCC-compatible ones, but for that we need to move the decoration in front of the function declaration instead of either behind it or wherever, which is the current style afforded by GCC-style attributes. Also rename the macro to "pg_noreturn" to be similar to the C11 standard "noreturn". pg_noreturn is now supported on all compilers that support C11 (using _Noreturn), as well as GCC-compatible ones (using __attribute__, as before), as well as MSVC (using __declspec). (When PostgreSQL requires C11, the latter two variants can be dropped.) Now, all supported compilers effectively support pg_noreturn, so the extra code for !HAVE_PG_ATTRIBUTE_NORETURN can be dropped. This also fixes a possible problem if third-party code includes stdnoreturn.h, because then the current definition of #define pg_attribute_noreturn() __attribute__((noreturn)) would cause an error. Note that the C standard does not support a noreturn attribute on function pointer types. So we have to drop these here. There are only two instances at this time, so it's not a big loss. In one case, we can make up for it by adding the pg_noreturn to a wrapper function and adding a pg_unreachable(), in the other case, the latter was already done before. Reviewed-by: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://www.postgresql.org/message-id/flat/pxr5b3z7jmkpenssra5zroxi7qzzp6eswuggokw64axmdixpnk@zbwxuq7gbbcw
2025-02-10Add pg_encoding_set_invalid()Andres Freund
There are cases where we cannot / do not want to error out for invalidly encoded input. In such cases it can be useful to replace e.g. an incomplete multi-byte characters with bytes that will trigger an error when getting validated as part of a larger string. Unfortunately, until now, for some encoding no such sequence existed. For those encodings this commit removes one previously accepted input combination - we consider that to be ok, as the chosen bytes are outside of the valid ranges for the encodings, we just previously failed to detect that. As we cannot add a new field to pg_wchar_table without breaking ABI, this is implemented "in-line" in the newly added function. Author: Noah Misch <noah@leadboat.com> Reviewed-by: Andres Freund <andres@anarazel.de> Backpatch-through: 13 Security: CVE-2025-1094
2025-01-01Update copyright for 2025Bruce Momjian
Backpatch-through: 13
2024-03-20Inline basic UTF-8 functions.Jeff Davis
Shows a measurable speedup when processing UTF-8 data, such as with the new builtin collation provider. Discussion: https://postgr.es/m/163f4e2190cdf67f67016044e503c5004547e5a9.camel@j-davis.com Reviewed-by: Peter Eisentraut
2024-03-07Unicode case mapping tables and functions.Jeff Davis
Implements Unicode simple case mapping, in which all code points map to exactly one other code point unconditionally. These tables are generated from UnicodeData.txt, which is already being used by other infrastructure in src/common/unicode. The tables are checked into the source tree, so they only need to be regenerated when we update the Unicode version. In preparation for the builtin collation provider, and possibly useful for other callers. Discussion: https://postgr.es/m/ff4c2f2f9c8fc7ca27c1c24ae37ecaeaeaff6b53.camel%40j-davis.com Reviewed-by: Peter Eisentraut, Daniel Verite, Jeremy Schneider
2024-03-01Simplify pg_enc2gettext_tbl[] with C99-designated initializer syntaxMichael Paquier
This commit switches pg_enc2gettext_tbl[] in encnames.c to use a C99-designated initializer syntax. pg_bind_textdomain_codeset() is simplified so as it is possible to do a direct lookup at the gettext() array with a value of the enum pg_enc rather than doing a loop through all its elements, as long as the encoding value provided by GetDatabaseEncoding() is in the correct range of supported encoding values. Note that PG_MULE_INTERNAL gains a value in the array, pointing to NULL. Author: Jelte Fennema-Nio Discussion: https://postgr.es/m/CAGECzQT3caUbcCcszNewCCmMbCuyP7XNAm60J3ybd6PN5kH2Dw@mail.gmail.com
2024-02-29Use C99-designated initializer syntax for arrays related to encodingsMichael Paquier
This updates the following lookup arrays to use C99-designated initializer syntax, indexed based on the enum pg_enc: pg_enc2icu_tbl[] pg_enc2name_tbl[] pg_wchar_table[] This is more readable, and removes problems with ordering mistakes as this removes dependencies between the arrays and their lookup index in the enum pg_enc. So, adding new encodings becomes easier, even if this does not happen often. Author: Jelte Fennema-Nio Reviewed-by: Jian He, Japin Li Discussion: https://postgr.es/m/CAGECzQT3caUbcCcszNewCCmMbCuyP7XNAm60J3ybd6PN5kH2Dw@mail.gmail.com
2024-01-29Move is_valid_ascii() to ascii.h.Nathan Bossart
This function requires simd.h, which is a rather large dependency for a widely-used header file like pg_wchar.h. Furthermore, there is a report of a third-party tool that is struggling to use pg_wchar.h due to its dependence on simd.h (presumably because simd.h uses several intrinsics). Moving the function to the much less popular ascii.h resolves these issues for now. This commit is back-patched for the benefit of the aforementioned third-party tool. The simd.h dependency was only added in v16, but we've opted to back-patch to v15 so that is_valid_ascii() lives in the same file for all versions where it exists. This could break existing third-party code that uses the function, but we couldn't find any examples of such code. It should be possible to fix any code that this commit breaks by including ascii.h in the file that uses is_valid_ascii(). Author: Jubilee Young Reviewed-by: Tom Lane, John Naylor, Andres Freund, Eric Ridge Discussion: https://postgr.es/m/CAPNHn3oKJJxMsYq%2BqLYzVJOFrUcOr4OF1EC-KtFT-qh8nOOOtQ%40mail.gmail.com Backpatch-through: 15
2024-01-04Update copyright for 2024Bruce Momjian
Reported-by: Michael Paquier Discussion: https://postgr.es/m/ZZKTDPxBBMt3C0J9@paquier.xyz Backpatch-through: 12
2023-10-07Restore proper linkage of pg_char_to_encoding() and friends.Tom Lane
Back in the 8.3 era we discovered that it was problematic if libpq.so had encoding ID assignments different from the backend, which is possible because on some platforms libpq.so might be of a different major version from the calling programs. psql should use libpq's assignments, but initdb has to use the backend's, else it will put wrong values into pg_database. The solution devised in commit 8468146b0 relied on giving initdb its own copy of encnames.c rather than relying on the functions exported by libpq. Later, that metamorphosed into ensuring that libpgcommon got linked before libpq -- which made things OK for initdb but broke psql. We didn't notice for lack of any changes in enum pg_enc since then. Commit 06843df4a reversed that, fixing the latent bug in psql but adding one in initdb. The meson build infrastructure is also not being sufficiently careful about link order, and trying to make it so would be equally fragile. Hence, let's use a new scheme based on giving the libpq-exported symbols different real names than the same functions exported from libpgcommon.a or libpgcommon_srv.a. (We could distinguish those two cases as well, but there seems no need to.) libpq gets the official names to avoid an ABI break for libpq clients, while the other cases use #define's to make the real names "xxx_private" rather than "xxx". By controlling where the #define's are applied, we can force any particular client program to use one set or the other of the encnames.c functions. We cannot back-patch this, since it'd be an ABI break for backend loadable modules, but there seems little need to. We're just trying to ensure that the world is safe for hypothetical future additions to enum pg_enc. In passing this should fix "duplicate symbol" linker warnings that we've been seeing on AIX buildfarm members since commit 06843df4a. It's not very clear why that linker is complaining now, when there were strictly *more* duplicates visible before, but in any case this should remove the reason for complaint. Patch by me; thanks to Andres Freund for review. Discussion: https://postgr.es/m/2385119.1696354473@sss.pgh.pa.us
2023-01-02Update copyright for 2023Bruce Momjian
Backpatch-through: 11
2022-12-11Convert json_in and jsonb_in to report errors softly.Tom Lane
This requires a bit of further infrastructure-extension to allow trapping errors reported by numeric_in and pg_unicode_to_server, but otherwise it's pretty straightforward. In the case of jsonb_in, we are only capturing errors reported during the initial "parse" phase. The value-construction phase (JsonbValueToJsonb) can also throw errors if assorted implementation limits are exceeded. We should improve that, but it seems like a separable project. Andrew Dunstan and Tom Lane Discussion: https://postgr.es/m/3bac9841-fe07-713d-fa42-606c225567d6@dunslane.net
2022-09-20Harmonize parameter names in storage and AM code.Peter Geoghegan
Make sure that function declarations use names that exactly match the corresponding names from function definitions in storage, catalog, access method, executor, and logical replication code, as well as in miscellaneous utility/library code. Like other recent commits that cleaned up function parameter names, this commit was written with help from clang-tidy. Later commits will do the same for other parts of the codebase. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/CAH2-WznJt9CMM9KJTMjJh_zbL5hD9oX44qdJ4aqZtjFi-zA3Tg@mail.gmail.com
2022-08-26Use SSE2 in is_valid_ascii() where available.John Naylor
Per flame graph from Jelte Fennema, COPY FROM ... USING BINARY shows input validation taking at least 5% of the profile, so it's worth trying to be more efficient here. With this change, validation of pure ASCII is nearly 40% faster on contemporary Intel hardware. To make this change legible and easier to adopt to additional architectures, use helper functions to abstract the platform details away. Reviewed by Nathan Bossart Discussion: https://www.postgresql.org/message-id/CAFBsxsG%3Dk8t%3DC457FXnoBXb%3D8iA4OaZkbFogFMachWif7mNnww%40mail.gmail.com
2022-08-05Simplify coding style of is_valid_ascii()John Naylor
Calculate end of input rather than maintaining length, per prior suggestion from Heikki Linnakangas. In passing, use more natural language in a comment. Discussion: https://www.postgresql.org/message-id/b4648cc2-5e9c-c93a-52cc-51e5c658a4f6%40iki.fi
2022-04-08Apply PGDLLIMPORT markings broadly.Robert Haas
Up until now, we've had a policy of only marking certain variables in the PostgreSQL header files with PGDLLIMPORT, but now we've decided to mark them all. This means that extensions running on Windows should no longer operate at a disadvantage as compared to extensions running on Linux: if the variable is present in a header file, it should be accessible. Discussion: http://postgr.es/m/CA+TgmoYanc1_FSfimhgiWSqVyP5KKmh5NP2BWNwDhO8Pg2vGYQ@mail.gmail.com
2022-01-08Update copyright for 2022Bruce Momjian
Backpatch-through: 10
2021-12-20Add fast path for validating UTF-8 textJohn Naylor
Our previous validator used a traditional algorithm that performed comparison and branching one byte at a time. It's useful in that we always know exactly how many bytes we have validated, but that precision comes at a cost. Input validation can show up prominently in profiles of COPY FROM, and future improvements to COPY FROM such as parallelism or faster line parsing will put more pressure on input validation. Hence, add fast paths for both ASCII and multibyte UTF-8: Use bitwise operations to check 16 bytes at a time for ASCII. If that fails, use a "shift-based" DFA on those bytes to handle the general case, including multibyte. These paths are relatively free of branches and thus robust against all kinds of byte patterns. With these algorithms, UTF-8 validation is several times faster, depending on platform and the input byte distribution. The previous coding in pg_utf8_verifystr() is retained for short strings and for when the fast path returns an error. Review, performance testing, and additional hacking by: Heikki Linakangas, Vladimir Sitnikov, Amit Khandekar, Thomas Munro, and Greg Stark Discussion: https://www.postgresql.org/message-id/CAFBsxsEV_SzH%2BOLyCiyon%3DiwggSyMh_eF6A3LU2tiWf3Cy2ZQg%40mail.gmail.com
2021-06-07Fix incautious handling of possibly-miscoded strings in client code.Tom Lane
An incorrectly-encoded multibyte character near the end of a string could cause various processing loops to run past the string's terminating NUL, with results ranging from no detectable issue to a program crash, depending on what happens to be in the following memory. This isn't an issue in the server, because we take care to verify the encoding of strings before doing any interesting processing on them. However, that lack of care leaked into client-side code which shouldn't assume that anyone has validated the encoding of its input. Although this is certainly a bug worth fixing, the PG security team elected not to regard it as a security issue, primarily because any untrusted text should be sanitized by PQescapeLiteral or the like before being incorporated into a SQL or psql command. (If an app fails to do so, the same technique can be used to cause SQL injection, with probably much more dire consequences than a mere client-program crash.) Those functions were already made proof against this class of problem, cf CVE-2006-2313. To fix, invent PQmblenBounded() which is like PQmblen() except it won't return more than the number of bytes remaining in the string. In HEAD we can make this a new libpq function, as PQmblen() is. It seems imprudent to change libpq's API in stable branches though, so in the back branches define PQmblenBounded as a macro in the files that need it. (Note that just changing PQmblen's behavior would not be a good idea; notably, it would completely break the escaping functions' defense against this exact problem. So we just want a version for those callers that don't have any better way of handling this issue.) Per private report from houjingyi. Back-patch to all supported branches.
2021-04-01Do COPY FROM encoding conversion/verification in larger chunks.Heikki Linnakangas
This gives a small performance gain, by reducing the number of calls to the conversion/verification function, and letting it work with larger inputs. Also, reorganizing the input pipeline makes it easier to parallelize the input parsing: after the input has been converted to the database encoding, the next stage of finding the newlines can be done in parallel, because there cannot be any newline chars "embedded" in multi-byte characters in the encodings that we support as server encodings. This changes behavior in one corner case: if client and server encodings are the same single-byte encoding (e.g. latin1), previously the input would not be checked for zero bytes ('\0'). Any fields containing zero bytes would be truncated at the zero. But if encoding conversion was needed, the conversion routine would throw an error on the zero. After this commit, the input is always checked for zeros. Reviewed-by: John Naylor Discussion: https://www.postgresql.org/message-id/e7861509-3960-538a-9025-b75a61188e01%40iki.fi
2021-04-01Add 'noError' argument to encoding conversion functions.Heikki Linnakangas
With the 'noError' argument, you can try to convert a buffer without knowing the character boundaries beforehand. The functions now need to return the number of input bytes successfully converted. This is is a backwards-incompatible change, if you have created a custom encoding conversion with CREATE CONVERSION. This adds a check to pg_upgrade for that, refusing the upgrade if there are any user-defined encoding conversions. Custom conversions are very rare, there are no commonly used extensions that I know of that uses that feature. No other objects can depend on conversions, so if you do have one, you can fairly easily drop it before upgrading, and recreate it after the upgrade with an updated version. Add regression tests for built-in encoding conversions. This doesn't cover every conversion, but it covers all the internal functions in conv.c that are used to implement the conversions. Reviewed-by: John Naylor Discussion: https://www.postgresql.org/message-id/e7861509-3960-538a-9025-b75a61188e01%40iki.fi
2021-01-28Add mbverifystr() functions specific to each encoding.Heikki Linnakangas
This makes pg_verify_mbstr() function faster, by allowing more efficient encoding-specific implementations. All the implementations included in this commit are pretty naive, they just call the same encoding-specific verifychar functions that were used previously, but that already gives a performance boost because the tight character-at-a-time loop is simpler. Reviewed-by: John Naylor Discussion: https://www.postgresql.org/message-id/e7861509-3960-538a-9025-b75a61188e01@iki.fi
2021-01-02Update copyright for 2021Bruce Momjian
Backpatch-through: 9.5
2020-03-06Allow Unicode escapes in any server encoding, not only UTF-8.Tom Lane
SQL includes provisions for numeric Unicode escapes in string literals and identifiers. Previously we only accepted those if they represented ASCII characters or the server encoding was UTF-8, making the conversion to internal form trivial. This patch adjusts things so that we'll call the appropriate encoding conversion function in less-trivial cases, allowing the escape sequence to be accepted so long as it corresponds to some character available in the server encoding. This also applies to processing of Unicode escapes in JSONB. However, the old restriction still applies to client-side JSON processing, since that hasn't got access to the server's encoding conversion infrastructure. This patch includes some lexer infrastructure that simplifies throwing errors with error cursors pointing into the middle of a string (or other complex token). For the moment I only used it for errors relating to Unicode escapes, but we might later expand the usage to some other cases. Patch by me, reviewed by John Naylor. Discussion: https://postgr.es/m/2393.1578958316@sss.pgh.pa.us
2020-01-16Rationalize code placement between wchar.c, encnames.c, and mbutils.c.Tom Lane
Move all the backend-only code that'd crept into wchar.c and encnames.c into mbutils.c. To remove the last few #ifdef dependencies from wchar.c and encnames.c, also make the following changes: * Adjust get_encoding_name_for_icu to return NULL, not throw an error, for unsupported encodings. Its sole caller can perfectly well throw an error instead. (While at it, I also made this function and its sibling is_encoding_supported_by_icu proof against out-of-range encoding IDs.) * Remove the overlength-name error condition from pg_char_to_encoding. It's completely silly not to treat that just like any other the-name-is-not-in-the-table case. Also, get rid of pg_mic_mblen --- there's no obvious reason why conv.c shouldn't call pg_mule_mblen instead. Other than that, this is just code movement and comment-polishing with no functional changes. Notably, I reordered declarations in pg_wchar.h to show which functions are frontend-accessible and which are not. Discussion: https://postgr.es/m/CA+TgmoYO8oq-iy8E02rD8eX25T-9SmyxKWqqks5OMHxKvGXpXQ@mail.gmail.com
2020-01-16Move wchar.c and encnames.c to src/common/.Tom Lane
Formerly, various frontend directories symlinked these two sources and then built them locally. That's an ancient, ugly hack, and we now have a much better way: put them into libpgcommon. So do that. (The immediate motivation for this is the prospect of having to introduce still more symlinking if we don't.) This commit moves these two files absolutely verbatim, for ease of reviewing the git history. There's some follow-on work to be done that will modify them a bit. Robert Haas, Tom Lane Discussion: https://postgr.es/m/CA+TgmoYO8oq-iy8E02rD8eX25T-9SmyxKWqqks5OMHxKvGXpXQ@mail.gmail.com
2020-01-13Reduce size of backend scanner's tables.Tom Lane
Previously, the core scanner's yy_transition[] array had 37045 elements. Since that number is larger than INT16_MAX, Flex generated the array to contain 32-bit integers. By reimplementing some of the bulkier scanner rules, this patch reduces the array to 20495 elements. The much smaller total length, combined with the consequent use of 16-bit integers for the array elements reduces the binary size by over 200kB. This was accomplished in two ways: 1. Consolidate handling of quote continuations into a new start condition, rather than duplicating that logic for five different string types. 2. Treat Unicode strings and identifiers followed by a UESCAPE sequence as three separate tokens, rather than one. The logic to de-escape Unicode strings is moved to the filter code in parser.c, which already had the ability to provide special processing for token sequences. While we could have implemented the conversion in the grammar, that approach was rejected for performance and maintainability reasons. Performance in microbenchmarks of raw parsing seems equal or slightly faster in most cases, and it's reasonable to expect that in real-world usage (with more competition for the CPU cache) there will be a larger win. The exception is UESCAPE sequences; lexing those is about 10% slower, primarily because the scanner now has to be called three times rather than one. This seems acceptable since that feature is very rarely used. The psql and epcg lexers are likewise modified, primarily because we want to keep them all in sync. Since those lexers don't use the space-hogging -CF option, the space savings is much less, but it's still good for perhaps 10kB apiece. While at it, merge the ecpg lexer's handling of C-style comments used in SQL and in C. Those have different rules regarding nested comments, but since we already have the ability to keep track of the previous start condition, we can use that to handle both cases within a single start condition. This matches the core scanner more closely. John Naylor Discussion: https://postgr.es/m/CACPNZCvaoa3EgVWm5yZhcSTX6RAtaLgniCPcBVOCwm8h3xpWkw@mail.gmail.com
2020-01-01Update copyrights for 2020Bruce Momjian
Backpatch-through: update all files in master, backpatch legal files through 9.4
2019-12-10Add backend-only appendStringInfoStringQuotedAlvaro Herrera
This provides a mechanism to emit literal values in informative messages, such as query parameters. The new code is more complex than what it replaces, primarily because it wants to be more efficient. It also has the (currently unused) additional optional capability of specifying a maximum size to print. The new function lives out of common/stringinfo.c so that frontend users of that file need not pull in unnecessary multibyte-encoding support code. Author: Álvaro Herrera and Alexey Bashtanov, after a suggestion from Andres Freund Reviewed-by: Tom Lane Discussion: https://postgr.es/m/20190920203905.xkv5udsd5dxfs6tr@alap3.anarazel.de
2019-08-05Fix inconsistencies and typos in the tree, take 9Michael Paquier
This addresses more issues with code comments, variable names and unreferenced variables. Author: Alexander Lakhin Discussion: https://postgr.es/m/7ab243e0-116d-3e44-d120-76b3df7abefd@gmail.com
2019-07-22Fix inconsistencies and typos in the treeMichael Paquier
This is numbered take 7, and addresses a set of issues with code comments, variable names and unreferenced variables. Author: Alexander Lakhin Discussion: https://postgr.es/m/dff75442-2468-f74f-568c-6006e141062f@gmail.com
2019-07-05Remove dead encoding-conversion functions.Tom Lane
The code for conversions SQL_ASCII <-> MULE_INTERNAL and SQL_ASCII <-> UTF8 was unreachable, because we long ago changed the wrapper functions pg_do_encoding_conversion() et al so that they have hard-wired behaviors for conversions involving SQL_ASCII. (At least some of those fast paths date back to 2002, though it looks like we may not have been totally consistent about this until later.) Given the lack of complaints, nobody is dissatisfied with this state of affairs. Hence, let's just remove the unreachable code. Also, change CREATE CONVERSION so that it rejects attempts to define such conversions. Since we consider that SQL_ASCII represents lack of knowledge about the encoding in use, such a conversion would be semantically dubious even if it were reachable. Adjust a couple of regression test cases that had randomly decided to rely on these conversion functions rather than any other ones. Discussion: https://postgr.es/m/41163.1559156593@sss.pgh.pa.us
2019-05-22Phase 2 pgindent run for v12.Tom Lane
Switch to 2.1 version of pg_bsd_indent. This formats multiline function declarations "correctly", that is with additional lines of parameter declarations indented to match where the first line's left parenthesis is. Discussion: https://postgr.es/m/CAEepm=0P3FeTXRcU5B2W3jv3PgRVZ-kGUXLGfd42FFhUROO3ug@mail.gmail.com
2019-01-02Update copyright for 2019Bruce Momjian
Backpatch-through: certain files through 9.4
2018-04-01Fix a boatload of typos in C comments.Tom Lane
Justin Pryzby Discussion: https://postgr.es/m/20180331105640.GK28454@telsasoft.com
2018-01-03Update copyright for 2018Bruce Momjian
Backpatch-through: certain files through 9.3
2017-10-11Add more efficient functions to pqformat API.Andres Freund
There's three prongs to achieve greater efficiency here: 1) Allow reusing a stringbuffer across pq_beginmessage/endmessage, with the new pq_beginmessage_reuse/endmessage_reuse. This can be beneficial both because it avoids allocating the initial buffer, and because it's more likely to already have an correctly sized buffer. 2) Replacing pq_sendint() with pq_sendint$width() inline functions. Previously unnecessary and unpredictable branches in pq_sendint() were needed. Additionally the replacement functions are implemented more efficiently. pq_sendint is now deprecated, a separate commit will convert all in-tree callers. 3) Add pq_writeint$width(), pq_writestring(). These rely on sufficient space in the StringInfo's buffer, avoiding individual space checks & potential individual resizing. To allow this to be used for strings, expose mbutil.c's MAX_CONVERSION_GROWTH. Followup commits will make use of these facilities. Author: Andres Freund Discussion: https://postgr.es/m/20170914063418.sckdzgjfrsbekae4@alap3.anarazel.de
2017-06-21Phase 3 of pgindent updates.Tom Lane
Don't move parenthesized lines to the left, even if that means they flow past the right margin. By default, BSD indent lines up statement continuation lines that are within parentheses so that they start just to the right of the preceding left parenthesis. However, traditionally, if that resulted in the continuation line extending to the right of the desired right margin, then indent would push it left just far enough to not overrun the margin, if it could do so without making the continuation line start to the left of the current statement indent. That makes for a weird mix of indentations unless one has been completely rigid about never violating the 80-column limit. This behavior has been pretty universally panned by Postgres developers. Hence, disable it with indent's new -lpl switch, so that parenthesized lines are always lined up with the preceding left paren. This patch is much less interesting than the first round of indent changes, but also bulkier, so I thought it best to separate the effects. Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21Phase 2 of pgindent updates.Tom Lane
Change pg_bsd_indent to follow upstream rules for placement of comments to the right of code, and remove pgindent hack that caused comments following #endif to not obey the general rule. Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using the published version of pg_bsd_indent, but a hacked-up version that tried to minimize the amount of movement of comments to the right of code. The situation of interest is where such a comment has to be moved to the right of its default placement at column 33 because there's code there. BSD indent has always moved right in units of tab stops in such cases --- but in the previous incarnation, indent was working in 8-space tab stops, while now it knows we use 4-space tabs. So the net result is that in about half the cases, such comments are placed one tab stop left of before. This is better all around: it leaves more room on the line for comment text, and it means that in such cases the comment uniformly starts at the next 4-space tab stop after the code, rather than sometimes one and sometimes two tabs after. Also, ensure that comments following #endif are indented the same as comments following other preprocessor commands such as #else. That inconsistency turns out to have been self-inflicted damage from a poorly-thought-through post-indent "fixup" in pgindent. This patch is much less interesting than the first round of indent changes, but also bulkier, so I thought it best to separate the effects. Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21Initial pgindent run with pg_bsd_indent version 2.0.Tom Lane
The new indent version includes numerous fixes thanks to Piotr Stefaniak. The main changes visible in this commit are: * Nicer formatting of function-pointer declarations. * No longer unexpectedly removes spaces in expressions using casts, sizeof, or offsetof. * No longer wants to add a space in "struct structname *varname", as well as some similar cases for const- or volatile-qualified pointers. * Declarations using PG_USED_FOR_ASSERTS_ONLY are formatted more nicely. * Fixes bug where comments following declarations were sometimes placed with no space separating them from the code. * Fixes some odd decisions for comments following case labels. * Fixes some cases where comments following code were indented to less than the expected column 33. On the less good side, it now tends to put more whitespace around typedef names that are not listed in typedefs.list. This might encourage us to put more effort into typedef name collection; it's not really a bug in indent itself. There are more changes coming after this round, having to do with comment indentation and alignment of lines appearing within parentheses. I wanted to limit the size of the diffs to something that could be reviewed without one's eyes completely glazing over, so it seemed better to split up the changes as much as practical. Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-05-17Post-PG 10 beta1 pgindent runBruce Momjian
perltidy run not included.
2017-03-23ICU supportPeter Eisentraut
Add a column collprovider to pg_collation that determines which library provides the collation data. The existing choices are default and libc, and this adds an icu choice, which uses the ICU4C library. The pg_locale_t type is changed to a union that contains the provider-specific locale handles. Users of locale information are changed to look into that struct for the appropriate handle to use. Also add a collversion column that records the version of the collation when it is created, and check at run time whether it is still the same. This detects potentially incompatible library upgrades that can corrupt indexes and other structures. This is currently only supported by ICU-provided collations. initdb initializes the default collation set as before from the `locale -a` output but also adds all available ICU locales with a "-x-icu" appended. Currently, ICU-provided collations can only be explicitly named collations. The global database locales are still always libc-provided. ICU support is enabled by configure --with-icu. Reviewed-by: Thomas Munro <thomas.munro@enterprisedb.com> Reviewed-by: Andreas Karlsson <andreas@proxel.se>
2017-03-13Use radix tree for character encoding conversions.Heikki Linnakangas
Replace the mapping tables used to convert between UTF-8 and other character encodings with new radix tree-based maps. Looking up an entry in a radix tree is much faster than a binary search in the old maps. As a bonus, the radix tree representation is also more compact, making the binaries slightly smaller. The "combined" maps work the same as before, with binary search. They are much smaller than the main tables, so it doesn't matter so much. However, the "combined" maps are now stored in the same .map files as the main tables. This seems more clear, since they're always used together, and generated from the same source files. Patch by Kyotaro Horiguchi, with lot of hacking by me at various stages. Reviewed by Michael Paquier and Daniel Gustafsson. Discussion: https://www.postgresql.org/message-id/20170306.171609.204324917.horiguchi.kyotaro%40lab.ntt.co.jp
2017-01-03Update copyright via script for 2017Bruce Momjian
2016-01-02Update copyright for 2016Bruce Momjian
Backpatch certain files through 9.1
2015-11-28Avoid doing encoding conversions by double-conversion via MULE_INTERNAL.Tom Lane
Previously, we did many conversions for Cyrillic and Central European single-byte encodings by converting to a related MULE_INTERNAL coding scheme before converting to the destination. This seems unnecessarily inefficient. Moreover, if the conversion encounters an untranslatable character, the error message will confusingly complain about failure to convert to or from MULE_INTERNAL, rather than the user-visible encodings. Worse still, this approach results in some completely unnecessary conversion failures; there are cases where the chosen MULE subset lacks characters that exist in both of the user-visible encodings, causing a conversion failure that need not occur. This patch fixes the first two of those deficiencies by introducing a new local2local() conversion support subroutine for direct conversion between any two single-byte character sets, and adding new conversion tables where needed. However, I generated the new conversion tables by testing PG 9.5's behavior, so that the actual conversion behavior is bug-compatible with previous releases; the only user-visible behavior change is that the error messages for conversion failures are saner. Changes in the conversion behavior will probably ensue after discussion. Interestingly, although this approach requires more tables, the .so files actually end up smaller (at least on my x86_64 machine); the tables are smaller than the management code needed for double conversion. Per a complaint from Albe Laurenz.
2015-05-15Teach UtfToLocal/LocalToUtf to support algorithmic encoding conversions.Tom Lane
Until now, these functions have only supported encoding conversions using lookup tables, which is fine as long as there's not too many code points to convert. However, GB18030 expects all 1.1 million Unicode code points to be convertible, which would require a ridiculously-sized lookup table. Fortunately, a large fraction of those conversions can be expressed through arithmetic, ie the conversions are one-to-one in certain defined ranges. To support that, provide a callback function that is used after consulting the lookup tables. (This patch doesn't actually change anything about the GB18030 conversion behavior, just provide infrastructure for fixing it.) Since this requires changing the APIs of UtfToLocal/LocalToUtf anyway, take the opportunity to rearrange their argument lists into what seems to me a saner order. And beautify the call sites by using lengthof() instead of error-prone sizeof() arithmetic. In passing, also mark all the lookup tables used by these calls "const". This moves an impressive amount of stuff into the text segment, at least on my machine, and is safer anyhow.