summaryrefslogtreecommitdiff
path: root/doc/src
diff options
context:
space:
mode:
authorTeodor Sigaev2016-03-04 17:08:10 +0000
committerTeodor Sigaev2016-03-04 17:08:47 +0000
commitd78a7d9c7fa3e9cd494b906f065fe7b7fe9fb9a5 (patch)
tree23389711b4ccf0f5c8dd7684ae102b9eac5df66c /doc/src
parent9445db925e78c2c4fb12067ad5618e2aecabe109 (diff)
Improve support of Hunspell in ispell dictionary.
Now it's possible to load recent version of Hunspell for several languages. To handle these dictionaries Hunspell patch adds support for: * FLAG long - sets the double extended ASCII character flag type * FLAG num - sets the decimal number flag type (from 1 to 65535) * AF parameter - alias for flag's set Also it moves test dictionaries into separate directory. Author: Artur Zakirov with editorization by me
Diffstat (limited to 'doc/src')
-rw-r--r--doc/src/sgml/textsearch.sgml148
1 files changed, 140 insertions, 8 deletions
diff --git a/doc/src/sgml/textsearch.sgml b/doc/src/sgml/textsearch.sgml
index d66b4d5d5f..ff99976068 100644
--- a/doc/src/sgml/textsearch.sgml
+++ b/doc/src/sgml/textsearch.sgml
@@ -2615,18 +2615,41 @@ SELECT plainto_tsquery('supernova star');
</para>
<para>
- To create an <application>Ispell</> dictionary, use the built-in
- <literal>ispell</literal> template and specify several parameters:
+ To create an <application>Ispell</> dictionary perform these steps:
</para>
-
+ <itemizedlist spacing="compact" mark="bullet">
+ <listitem>
+ <para>
+ download dictionary configuration files. <productname>OpenOffice</>
+ extension files have the <filename>.oxt</> extension. It is necessary
+ to extract <filename>.aff</> and <filename>.dic</> files, change
+ extensions to <filename>.affix</> and <filename>.dict</>. For some
+ dictionary files it is also needed to convert characters to the UTF-8
+ encoding with commands (for example, for norwegian language dictionary):
<programlisting>
-CREATE TEXT SEARCH DICTIONARY english_ispell (
+iconv -f ISO_8859-1 -t UTF-8 -o nn_no.affix nn_NO.aff
+iconv -f ISO_8859-1 -t UTF-8 -o nn_no.dict nn_NO.dic
+</programlisting>
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ copy files to the <filename>$SHAREDIR/tsearch_data</> directory
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ load files into PostgreSQL with the following command:
+<programlisting>
+CREATE TEXT SEARCH DICTIONARY english_hunspell (
TEMPLATE = ispell,
- DictFile = english,
- AffFile = english,
- StopWords = english
-);
+ DictFile = en_us,
+ AffFile = en_us,
+ Stopwords = english);
</programlisting>
+ </para>
+ </listitem>
+ </itemizedlist>
<para>
Here, <literal>DictFile</>, <literal>AffFile</>, and <literal>StopWords</>
@@ -2643,6 +2666,56 @@ CREATE TEXT SEARCH DICTIONARY english_ispell (
</para>
<para>
+ The <filename>.affix</> file of <application>Ispell</> has the following
+ structure:
+<programlisting>
+prefixes
+flag *A:
+ . > RE # As in enter > reenter
+suffixes
+flag T:
+ E > ST # As in late > latest
+ [^AEIOU]Y > -Y,IEST # As in dirty > dirtiest
+ [AEIOU]Y > EST # As in gray > grayest
+ [^EY] > EST # As in small > smallest
+</programlisting>
+ </para>
+ <para>
+ And the <filename>.dict</> file has the following structure:
+<programlisting>
+lapse/ADGRS
+lard/DGRS
+large/PRTY
+lark/MRS
+</programlisting>
+ </para>
+
+ <para>
+ Format of the <filename>.dict</> file is:
+<programlisting>
+basic_form/affix_class_name
+</programlisting>
+ </para>
+
+ <para>
+ In the <filename>.affix</> file every affix flag is described in the
+ following format:
+<programlisting>
+condition > [-stripping_letters,] adding_affix
+</programlisting>
+ </para>
+
+ <para>
+ Here, condition has a format similar to the format of regular expressions.
+ It can use groupings <literal>[...]</> and <literal>[^...]</>.
+ For example, <literal>[AEIOU]Y</> means that the last letter of the word
+ is <literal>"y"</> and the penultimate letter is <literal>"a"</>,
+ <literal>"e"</>, <literal>"i"</>, <literal>"o"</> or <literal>"u"</>.
+ <literal>[^EY]</> means that the last letter is neither <literal>"e"</>
+ nor <literal>"y"</>.
+ </para>
+
+ <para>
Ispell dictionaries support splitting compound words;
a useful feature.
Notice that the affix file should specify a special flag using the
@@ -2663,6 +2736,65 @@ SELECT ts_lexize('norwegian_ispell', 'sjokoladefabrikk');
</programlisting>
</para>
+ <para>
+ <application>MySpell</> format is a subset of <application>Hunspell</>.
+ The <filename>.affix</> file of <application>Hunspell</> has the following
+ structure:
+<programlisting>
+PFX A Y 1
+PFX A 0 re .
+SFX T N 4
+SFX T 0 st e
+SFX T y iest [^aeiou]y
+SFX T 0 est [aeiou]y
+SFX T 0 est [^ey]
+</programlisting>
+ </para>
+
+ <para>
+ The first line of an affix class is the header. Fields of an affix rules are
+ listed after the header:
+ </para>
+ <itemizedlist spacing="compact" mark="bullet">
+ <listitem>
+ <para>
+ parameter name (PFX or SFX)
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ flag (name of the affix class)
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ stripping characters from beginning (at prefix) or end (at suffix) of the
+ word
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ adding affix
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ condition that has a format similar to the format of regular expressions.
+ </para>
+ </listitem>
+ </itemizedlist>
+
+ <para>
+ The <filename>.dict</> file looks like the <filename>.dict</> file of
+ <application>Ispell</>:
+<programlisting>
+larder/M
+lardy/RT
+large/RSPMYT
+largehearted
+</programlisting>
+ </para>
+
<note>
<para>
<application>MySpell</> does not support compound words.