Improve support of Hunspell in ispell dictionary.

Now it's possible to load recent version of Hunspell for several languages. To handle these dictionaries Hunspell patch adds support for: * FLAG long - sets the double extended ASCII character flag type * FLAG num - sets the decimal number flag type (from 1 to 65535) * AF parameter - alias for flag's set Also it moves test dictionaries into separate directory. Author: Artur Zakirov with editorization by me
author: Teodor Sigaev 2016-03-04 17:08:10 +0000
committer: Teodor Sigaev 2016-03-04 17:08:47 +0000
commit: d78a7d9c7fa3e9cd494b906f065fe7b7fe9fb9a5 (patch)
tree: 23389711b4ccf0f5c8dd7684ae102b9eac5df66c /doc/src
parent: 9445db925e78c2c4fb12067ad5618e2aecabe109 (diff)
1 files changed, 140 insertions, 8 deletions
diff --git a/doc/src/sgml/textsearch.sgml b/doc/src/sgml/textsearch.sgml
index d66b4d5d5f..ff99976068 100644
--- a/doc/src/sgml/textsearch.sgml
+++ b/doc/src/sgml/textsearch.sgml
@@ -2615,18 +2615,41 @@ SELECT plainto_tsquery('supernova star');
    </para>
 
    <para>
-    To create an <application>Ispell</> dictionary, use the built-in
-    <literal>ispell</literal> template and specify several parameters:
+    To create an <application>Ispell</> dictionary perform these steps:
    </para>
-
+   <itemizedlist spacing="compact" mark="bullet">
+    <listitem>
+     <para>
+      download dictionary configuration files. <productname>OpenOffice</>
+      extension files have the <filename>.oxt</> extension. It is necessary
+      to extract <filename>.aff</> and <filename>.dic</> files, change
+      extensions to <filename>.affix</> and <filename>.dict</>. For some
+      dictionary files it is also needed to convert characters to the UTF-8
+      encoding with commands (for example, for norwegian language dictionary):
 <programlisting>
-CREATE TEXT SEARCH DICTIONARY english_ispell (
+iconv -f ISO_8859-1 -t UTF-8 -o nn_no.affix nn_NO.aff
+iconv -f ISO_8859-1 -t UTF-8 -o nn_no.dict nn_NO.dic
+</programlisting>
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      copy files to the <filename>$SHAREDIR/tsearch_data</> directory
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      load files into PostgreSQL with the following command:
+<programlisting>
+CREATE TEXT SEARCH DICTIONARY english_hunspell (
     TEMPLATE = ispell,
-    DictFile = english,
-    AffFile = english,
-    StopWords = english
-);
+    DictFile = en_us,
+    AffFile = en_us,
+    Stopwords = english);
 </programlisting>
+     </para>
+    </listitem>
+   </itemizedlist>
 
    <para>
     Here, <literal>DictFile</>, <literal>AffFile</>, and <literal>StopWords</>
@@ -2643,6 +2666,56 @@ CREATE TEXT SEARCH DICTIONARY english_ispell (
    </para>
 
    <para>
+    The <filename>.affix</> file of <application>Ispell</> has the following
+    structure:
+<programlisting>
+prefixes
+flag *A:
+    .           >   RE      # As in enter > reenter
+suffixes
+flag T:
+    E           >   ST      # As in late > latest
+    [^AEIOU]Y   >   -Y,IEST # As in dirty > dirtiest
+    [AEIOU]Y    >   EST     # As in gray > grayest
+    [^EY]       >   EST     # As in small > smallest
+</programlisting>
+   </para>
+   <para>
+    And the <filename>.dict</> file has the following structure:
+<programlisting>
+lapse/ADGRS
+lard/DGRS
+large/PRTY
+lark/MRS
+</programlisting>
+   </para>
+
+   <para>
+    Format of the <filename>.dict</> file is:
+<programlisting>
+basic_form/affix_class_name
+</programlisting>
+   </para>
+
+   <para>
+    In the <filename>.affix</> file every affix flag is described in the
+    following format:
+<programlisting>
+condition > [-stripping_letters,] adding_affix
+</programlisting>
+   </para>
+
+   <para>
+    Here, condition has a format similar to the format of regular expressions.
+    It can use groupings <literal>[...]</> and <literal>[^...]</>.
+    For example, <literal>[AEIOU]Y</> means that the last letter of the word
+    is <literal>"y"</> and the penultimate letter is <literal>"a"</>,
+    <literal>"e"</>, <literal>"i"</>, <literal>"o"</> or <literal>"u"</>.
+    <literal>[^EY]</> means that the last letter is neither <literal>"e"</>
+    nor <literal>"y"</>.
+   </para>
+
+   <para>
     Ispell dictionaries support splitting compound words;
     a useful feature.
     Notice that the affix file should specify a special flag using the
@@ -2663,6 +2736,65 @@ SELECT ts_lexize('norwegian_ispell', 'sjokoladefabrikk');
 </programlisting>
    </para>
 
+   <para>
+    <application>MySpell</> format is a subset of <application>Hunspell</>.
+    The <filename>.affix</> file of <application>Hunspell</> has the following
+    structure:
+<programlisting>
+PFX A Y 1
+PFX A   0     re         .
+SFX T N 4
+SFX T   0     st         e
+SFX T   y     iest       [^aeiou]y
+SFX T   0     est        [aeiou]y
+SFX T   0     est        [^ey]
+</programlisting>
+   </para>
+
+   <para>
+    The first line of an affix class is the header. Fields of an affix rules are
+    listed after the header:
+   </para>
+   <itemizedlist spacing="compact" mark="bullet">
+    <listitem>
+     <para>
+      parameter name (PFX or SFX)
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      flag (name of the affix class)
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      stripping characters from beginning (at prefix) or end (at suffix) of the
+      word
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      adding affix
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      condition that has a format similar to the format of regular expressions.
+     </para>
+    </listitem>
+   </itemizedlist>
+
+   <para>
+    The <filename>.dict</> file looks like the <filename>.dict</> file of
+    <application>Ispell</>:
+<programlisting>
+larder/M
+lardy/RT
+large/RSPMYT
+largehearted
+</programlisting>
+   </para>
+
    <note>
     <para>
      <application>MySpell</> does not support compound words.
author	Teodor Sigaev	2016-03-04 17:08:10 +0000
committer	Teodor Sigaev	2016-03-04 17:08:47 +0000
commit	d78a7d9c7fa3e9cd494b906f065fe7b7fe9fb9a5 (patch)
tree	23389711b4ccf0f5c8dd7684ae102b9eac5df66c /doc/src
parent	9445db925e78c2c4fb12067ad5618e2aecabe109 (diff)