From 61d967498802ab86d8897cb3c61740d7e9d712f6 Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Tue, 23 Sep 2008 09:20:39 +0000 Subject: Make LC_COLLATE and LC_CTYPE database-level settings. Collation and ctype are now more like encoding, stored in new datcollate and datctype columns in pg_database. This is a stripped-down version of Radek Strnad's patch, with further changes by me. --- doc/src/sgml/catalogs.sgml | 16 +++++++- doc/src/sgml/charset.sgml | 73 +++++++++++++++-------------------- doc/src/sgml/indices.sgml | 6 +-- doc/src/sgml/ref/create_database.sgml | 45 ++++++++++++++++++--- doc/src/sgml/ref/initdb.sgml | 41 ++++++++++++-------- doc/src/sgml/ref/pg_controldata.sgml | 4 +- doc/src/sgml/ref/pg_resetxlog.sgml | 14 +++---- doc/src/sgml/ref/select.sgml | 5 +-- doc/src/sgml/ref/show.sgml | 10 ++--- doc/src/sgml/runtime.sgml | 13 ++++--- doc/src/sgml/textsearch.sgml | 4 +- 11 files changed, 137 insertions(+), 94 deletions(-) (limited to 'doc/src') diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml index 40f1ce568e..bf1ac314f7 100644 --- a/doc/src/sgml/catalogs.sgml +++ b/doc/src/sgml/catalogs.sgml @@ -1,4 +1,4 @@ - + @@ -2149,6 +2149,20 @@ this number to the encoding name) + + datcollate + name + + LC_COLLATE for this database + + + + datctype + name + + LC_CTYPE for this database + + datistemplate bool diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml index 1f4866b203..c012294ef8 100644 --- a/doc/src/sgml/charset.sgml +++ b/doc/src/sgml/charset.sgml @@ -1,4 +1,4 @@ - + Localization</> @@ -130,23 +130,23 @@ initdb --locale=sv_SE <para> The nature of some locale categories is that their value has to be - fixed for the lifetime of a database cluster. That is, once - <command>initdb</command> has run, you cannot change them anymore. - <literal>LC_COLLATE</literal> and <literal>LC_CTYPE</literal> are - those categories. They affect the sort order of indexes, so they - must be kept fixed, or indexes on text columns will become corrupt. - <productname>PostgreSQL</productname> enforces this by recording - the values of <envar>LC_COLLATE</> and <envar>LC_CTYPE</> that are - seen by <command>initdb</>. The server automatically adopts - those two values when it is started. + fixed when the database is created. You can use different settings + for different databases, but once a database is created, you cannot + change them for that database anymore. <literal>LC_COLLATE</literal> + and <literal>LC_CTYPE</literal> are those categories. They affect + the sort order of indexes, so they must be kept fixed, or indexes on + text columns will become corrupt. The default values for these + categories are defined when <command>initdb</command> is run, and + those values are used when new databases are created, unless + specified otherwise in the <command>CREATE DATABASE</command> command. </para> <para> The other locale categories can be changed as desired whenever the server is running by setting the run-time configuration variables that have the same name as the locale categories (see <xref - linkend="runtime-config-client-format"> for details). The defaults that are - chosen by <command>initdb</command> are actually only written into + linkend="runtime-config-client-format"> for details). The defaults + that are chosen by <command>initdb</command> are actually only written into the configuration file <filename>postgresql.conf</filename> to serve as defaults when the server is started. If you delete these assignments from <filename>postgresql.conf</filename> then the @@ -261,7 +261,7 @@ initdb --locale=sv_SE <para> Check that <productname>PostgreSQL</> is actually using the locale - that you think it is. <envar>LC_COLLATE</> and <envar>LC_CTYPE</> + that you think it is. The default <envar>LC_COLLATE</> and <envar>LC_CTYPE</> settings are determined at <command>initdb</> time and cannot be changed without repeating <command>initdb</>. Other locale settings including <envar>LC_MESSAGES</> and <envar>LC_MONETARY</> @@ -319,17 +319,11 @@ initdb --locale=sv_SE </para> <para> - An important restriction, however, is that each database character set - must be compatible with the server's <envar>LC_CTYPE</> setting. + An important restriction, however, is that each database's character set + must be compatible with the database's <envar>LC_CTYPE</> setting. When <envar>LC_CTYPE</> is <literal>C</> or <literal>POSIX</>, any character set is allowed, but for other settings of <envar>LC_CTYPE</> there is only one character set that will work correctly. - Since the <envar>LC_CTYPE</> setting is frozen by <command>initdb</>, the - apparent flexibility to use different encodings in different databases - of a cluster is more theoretical than real, except when you select - <literal>C</> or <literal>POSIX</> locale (thus disabling any real locale - awareness). It is likely that these mechanisms will be revisited in future - versions of <productname>PostgreSQL</productname>. </para> <sect2 id="multibyte-charset-supported"> @@ -734,19 +728,19 @@ initdb -E EUC_JP </para> <para> - If you have selected <literal>C</> or <literal>POSIX</> locale, - you can create a database with a different character set: + You can specify a non-default encoding at database creation time, + provided that the encoding is compatible with the selected locale: <screen> -createdb -E EUC_KR korean +createdb -E EUC_KR -T template0 --lc-collate=ko_KR.euckr --lc-ctype=ko_KR.euckr korean </screen> This will create a database named <literal>korean</literal> that - uses the character set <literal>EUC_KR</literal>. Another way to - accomplish this is to use this SQL command: + uses the character set <literal>EUC_KR</literal>, and locale <literal>ko_KR</literal>. + Another way to accomplish this is to use this SQL command: <programlisting> -CREATE DATABASE korean WITH ENCODING 'EUC_KR'; +CREATE DATABASE korean WITH ENCODING 'EUC_KR' COLLATE='ko_KR.euckr' CTYPE='ko_KR.euckr' TEMPLATE=template0; </programlisting> The encoding for a database is stored in the system catalog @@ -756,20 +750,17 @@ CREATE DATABASE korean WITH ENCODING 'EUC_KR'; <screen> $ <userinput>psql -l</userinput> - List of databases - Database | Owner | Encoding ----------------+---------+--------------- - euc_cn | t-ishii | EUC_CN - euc_jp | t-ishii | EUC_JP - euc_kr | t-ishii | EUC_KR - euc_tw | t-ishii | EUC_TW - mule_internal | t-ishii | MULE_INTERNAL - postgres | t-ishii | EUC_JP - regression | t-ishii | SQL_ASCII - template1 | t-ishii | EUC_JP - test | t-ishii | EUC_JP - utf8 | t-ishii | UTF8 -(9 rows) + List of databases + Name | Owner | Encoding | Collation | Ctype | Access Privileges +-----------+----------+-----------+-------------+-------------+------------------------------------- + clocaledb | hlinnaka | SQL_ASCII | C | C | + englishdb | hlinnaka | UTF8 | en_GB.UTF8 | en_GB.UTF8 | + japanese | hlinnaka | UTF8 | ja_JP.UTF8 | ja_JP.UTF8 | + korean | hlinnaka | EUC_KR | ko_KR.euckr | ko_KR.euckr | + postgres | hlinnaka | UTF8 | fi_FI.UTF8 | fi_FI.UTF8 | + template0 | hlinnaka | UTF8 | fi_FI.UTF8 | fi_FI.UTF8 | {=c/hlinnaka,hlinnaka=CTc/hlinnaka} + template1 | hlinnaka | UTF8 | fi_FI.UTF8 | fi_FI.UTF8 | {=c/hlinnaka,hlinnaka=CTc/hlinnaka} +(7 rows) </screen> </para> diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml index 2ab713c39b..0993a8be03 100644 --- a/doc/src/sgml/indices.sgml +++ b/doc/src/sgml/indices.sgml @@ -1,4 +1,4 @@ -<!-- $PostgreSQL: pgsql/doc/src/sgml/indices.sgml,v 1.74 2008/07/11 21:06:28 tgl Exp $ --> +<!-- $PostgreSQL: pgsql/doc/src/sgml/indices.sgml,v 1.75 2008/09/23 09:20:34 heikki Exp $ --> <chapter id="indexes"> <title id="indexes-title">Indexes @@ -157,7 +157,7 @@ CREATE INDEX test1_id_index ON test1 (id); if the pattern is a constant and is anchored to the beginning of the string — for example, col LIKE 'foo%' or col ~ '^foo', but not - col LIKE '%bar'. However, if your server does not + col LIKE '%bar'. However, if your database does not use the C locale you will need to create the index with a special operator class to support indexing of pattern-matching queries. See below. It is also possible to use @@ -922,7 +922,7 @@ CREATE INDEX name ON table according to the locale-specific collation rules. This makes these operator classes suitable for use by queries involving pattern matching expressions (LIKE or POSIX - regular expressions) when the server does not use the standard + regular expressions) when the database does not use the standard C locale. As an example, you might index a varchar column like this: diff --git a/doc/src/sgml/ref/create_database.sgml b/doc/src/sgml/ref/create_database.sgml index b1b1333245..5e72768981 100644 --- a/doc/src/sgml/ref/create_database.sgml +++ b/doc/src/sgml/ref/create_database.sgml @@ -1,5 +1,5 @@ @@ -24,6 +24,8 @@ CREATE DATABASE name [ [ WITH ] [ OWNER [=] dbowner ] [ TEMPLATE [=] template ] [ ENCODING [=] encoding ] + [ COLLATE [=] collate ] + [ CTYPE [=] ctype ] [ TABLESPACE [=] tablespace ] [ CONNECTION LIMIT [=] connlimit ] ] @@ -112,6 +114,29 @@ CREATE DATABASE name + + collate + + + Collation order (LC_COLLATE) to use in the new database. + This affects the sort order applied to strings, e.g in queries with + ORDER BY, as well as the order used in indexes on text columns. + The default is to use the collation order of the template database. + See below for additional restrictions. + + + + + ctype + + + Character classification (LC_CTYPE) to use in the new + database. This affects the categorization of characters, e.g. lower, + upper and digit. The default is to use the character classification of + the template database. See below for additional restrictions. + + + tablespace @@ -180,13 +205,11 @@ CREATE DATABASE name - Any character set encoding specified for the new database must be - compatible with the server's LC_CTYPE locale setting. + The character set encoding specified for the new database must be + compatible with the chosen COLLATE and CTYPE settings. If LC_CTYPE is C (or equivalently POSIX), then all encodings are allowed, but for other - locale settings there is only one encoding that will work properly, - and so the apparent freedom to specify an encoding is illusory if - you didn't initialize the database cluster in C locale. + locale settings there is only one encoding that will work properly. CREATE DATABASE will allow superusers to specify SQL_ASCII encoding regardless of the locale setting, but this choice is deprecated and may result in misbehavior of @@ -194,6 +217,16 @@ CREATE DATABASE name with the locale is stored in the database. + + The COLLATE and CTYPE settings must match + those of the template database, except when template0 is used as + template. This is because COLLATE and CTYPE + affects the ordering in indexes, so that any indexes copied from the + template database would be invalid in the new database with different + settings. template0, however, is known to not + contain any indexes that would be affected. + + The CONNECTION LIMIT option is only enforced approximately; if two new sessions start at about the same time when just one diff --git a/doc/src/sgml/ref/initdb.sgml b/doc/src/sgml/ref/initdb.sgml index 312da7085a..110c21eb8c 100644 --- a/doc/src/sgml/ref/initdb.sgml +++ b/doc/src/sgml/ref/initdb.sgml @@ -1,5 +1,5 @@ @@ -76,25 +76,34 @@ PostgreSQL documentation initdb initializes the database cluster's default - locale and character set encoding. The collation order - (LC_COLLATE) and character set classes - (LC_CTYPE, e.g. upper, lower, digit) are fixed for all - databases and cannot be changed. Collation orders other than - C or POSIX also have a performance penalty. - For these reasons it is important to choose the right locale when - running initdb. The remaining locale categories - can be changed later when the server is started. All server locale - values (lc_*) can be displayed via SHOW ALL. + locale and character set encoding. The character set encoding, + collation order (LC_COLLATE) and character set classes + (LC_CTYPE, e.g. upper, lower, digit) can be set separately + for a database when it is created. initdb determines + those settings for the template1 database, which will + serve as the default for all other databases. + + + + To alter the default collation order or character set classes, use the + and options. + Collation orders other than C or POSIX also have + a performance penalty. For these reasons it is important to choose the + right locale when running initdb. + + + + The remaining locale categories can be changed later when the server + is started. You can also use to set the + default for all locale categories, including collation order and + character set classes. All server locale values (lc_*) can + be displayed via SHOW ALL. More details can be found in . - The character set encoding can be set separately for a database when - it is created. initdb determines the encoding for - the template1 database, which will serve as the - default for all other databases. To alter the default encoding use - the option. More details can be found in - . + To alter the default encoding, use the . + More details can be found in . diff --git a/doc/src/sgml/ref/pg_controldata.sgml b/doc/src/sgml/ref/pg_controldata.sgml index 466c03e224..62695963e2 100644 --- a/doc/src/sgml/ref/pg_controldata.sgml +++ b/doc/src/sgml/ref/pg_controldata.sgml @@ -1,5 +1,5 @@ @@ -30,7 +30,7 @@ PostgreSQL documentation Description pg_controldata prints information initialized during - initdb, such as the catalog version and server locale. + initdb, such as the catalog version. It also shows information about write-ahead logging and checkpoint processing. This information is cluster-wide, and not specific to any one database. diff --git a/doc/src/sgml/ref/pg_resetxlog.sgml b/doc/src/sgml/ref/pg_resetxlog.sgml index 588ff38c1b..a9d34298e4 100644 --- a/doc/src/sgml/ref/pg_resetxlog.sgml +++ b/doc/src/sgml/ref/pg_resetxlog.sgml @@ -1,5 +1,5 @@ @@ -62,14 +62,10 @@ PostgreSQL documentation by specifying the -f (force) switch. In this case plausible values will be substituted for the missing data. Most of the fields can be expected to match, but manual assistance might be needed for the next OID, - next transaction ID and epoch, next multitransaction ID and offset, - WAL starting address, and database locale fields. - The first six of these can be set using the switches discussed below. - pg_resetxlog's own environment is the source for its - guess at the locale fields; take care that LANG and so forth - match the environment that initdb was run in. - If you are not able to determine correct values for all these fields, - -f can still be used, but + next transaction ID and epoch, next multitransaction ID and offset, and + WAL starting address fields. These fields can be set using the switches + discussed below. If you are not able to determine correct values for all + these fields, -f can still be used, but the recovered database must be treated with even more suspicion than usual: an immediate dump and reload is imperative. Do not execute any data-modifying operations in the database before you dump, diff --git a/doc/src/sgml/ref/select.sgml b/doc/src/sgml/ref/select.sgml index 000b5614dd..d8ed7aef9c 100644 --- a/doc/src/sgml/ref/select.sgml +++ b/doc/src/sgml/ref/select.sgml @@ -1,5 +1,5 @@ @@ -747,8 +747,7 @@ SELECT name FROM distributors ORDER BY code; Character-string data is sorted according to the locale-specific - collation order that was established when the database cluster - was initialized. + collation order that was established when the database was created. diff --git a/doc/src/sgml/ref/show.sgml b/doc/src/sgml/ref/show.sgml index ebd1acee35..fdc348053e 100644 --- a/doc/src/sgml/ref/show.sgml +++ b/doc/src/sgml/ref/show.sgml @@ -1,5 +1,5 @@ @@ -82,8 +82,8 @@ SHOW ALL Shows the database's locale setting for collation (text ordering). At present, this parameter can be shown but not - set, because the setting is determined at - initdb time. + set, because the setting is determined at database creation + time. @@ -94,8 +94,8 @@ SHOW ALL Shows the database's locale setting for character classification. At present, this parameter can be shown but - not set, because the setting is determined at - initdb time. + not set, because the setting is determined at database creation + time. diff --git a/doc/src/sgml/runtime.sgml b/doc/src/sgml/runtime.sgml index 75c6d266e9..adde49e1a3 100644 --- a/doc/src/sgml/runtime.sgml +++ b/doc/src/sgml/runtime.sgml @@ -1,4 +1,4 @@ - + Operating System Environment @@ -145,11 +145,12 @@ postgres$ initdb -D /usr/local/pgsql/data Normally, it will just take the locale settings in the environment and apply them to the initialized database. It is possible to specify a different locale for the database; more information about - that can be found in . The sort order used - within a particular database cluster is set by - initdb and cannot be changed later, short of - dumping all data, rerunning initdb, and reloading - the data. There is also a performance impact for using locales + that can be found in . The default sort order used + within the particular database cluster is set by + initdb, and while you can create new databases using + different sort order, the order used in the template databases that initdb + creates cannot be changed without dropping and recreating them. + There is also a performance impact for using locales other than C or POSIX. Therefore, it is important to make this choice correctly the first time. diff --git a/doc/src/sgml/textsearch.sgml b/doc/src/sgml/textsearch.sgml index 41db566b6c..45a9f5a389 100644 --- a/doc/src/sgml/textsearch.sgml +++ b/doc/src/sgml/textsearch.sgml @@ -1,4 +1,4 @@ - + Full Text Search @@ -1896,7 +1896,7 @@ LIMIT 10; - The parser's notion of a letter is determined by the server's + The parser's notion of a letter is determined by the database's locale setting, specifically lc_ctype. Words containing only the basic ASCII letters are reported as a separate token type, since it is sometimes useful to distinguish them. In most European -- cgit v1.2.3