<!--
-$Header: /cvsroot/pgsql/doc/src/sgml/func.sgml,v 1.136 2003/01/23 23:38:51 petere Exp $
+$Header: /cvsroot/pgsql/doc/src/sgml/func.sgml,v 1.137 2003/02/05 17:41:32 tgl Exp $
PostgreSQL documentation
-->
<row>
<entry> <literal>&</literal> </entry>
<entry>binary AND</entry>
- <entry>91 & 15</entry>
+ <entry>91 & 15</entry>
<entry>11</entry>
</row>
The <quote>binary</quote> operators are also available for the bit
string types <type>BIT</type> and <type>BIT VARYING</type>, as
shown in <xref linkend="functions-math-bit-table">.
- Bit string arguments to <literal>&</literal>, <literal>|</literal>,
+ Bit string arguments to <literal>&</literal>, <literal>|</literal>,
and <literal>#</literal> must be of equal length. When bit
shifting, the original length of the string is preserved, as shown
in the table.
<tbody>
<row>
- <entry>B'10001' & B'01101'</entry>
+ <entry>B'10001' & B'01101'</entry>
<entry>00001</entry>
</row>
<row>
one whose left parenthesis comes first) is
returned. You can always put parentheses around the whole expression
if you want to use parentheses within it without triggering this
- exception.
+ exception. Also see the non-capturing parentheses described below.
</para>
<para>
</programlisting>
</para>
-<!-- derived from the re_format.7 man page -->
+ <para>
+ <productname>PostgreSQL</productname>'s regular expressions are implemented
+ using a package written by Henry Spencer. Much of
+ the description of regular expressions below is copied verbatim from his
+ manual entry.
+ </para>
+
+<!-- derived from the re_syntax.n man page -->
+
+ <sect3 id="posix-syntax-details">
+ <title>Regular Expression Details</title>
+
<para>
Regular expressions (<acronym>RE</acronym>s), as defined in
- <acronym>POSIX</acronym>
- 1003.2, come in two forms: modern <acronym>RE</acronym>s (roughly those of
- <command>egrep</command>; 1003.2 calls these
- <quote>extended</quote> <acronym>RE</acronym>s) and obsolete <acronym>RE</acronym>s (roughly those of
- <command>ed</command>; 1003.2 <quote>basic</quote> <acronym>RE</acronym>s).
- <productname>PostgreSQL</productname> implements the modern form.
+ <acronym>POSIX</acronym> 1003.2, come in two forms:
+ <firstterm>extended</> <acronym>RE</acronym>s or <acronym>ERE</>s
+ (roughly those of <command>egrep</command>), and
+ <firstterm>basic</> <acronym>RE</acronym>s or <acronym>BRE</>s
+ (roughly those of <command>ed</command>).
+ <productname>PostgreSQL</productname> supports both forms, and
+ also implements some extensions
+ that are not in the POSIX standard, but have become widely used anyway
+ due to their availability in programming languages such as Perl and Tcl.
+ <acronym>RE</acronym>s using these non-POSIX extensions are called
+ <firstterm>advanced</> <acronym>RE</acronym>s or <acronym>ARE</>s
+ in this documentation. We first describe the ERE/ARE flavor and then
+ mention the restrictions of the BRE form.
</para>
<para>
- A (modern) RE is one or more non-empty
+ A regular expression is defined as one or more
<firstterm>branches</firstterm>, separated by
<literal>|</literal>. It matches anything that matches one of the
branches.
</para>
<para>
- A branch is one or more <firstterm>pieces</firstterm>,
- concatenated. It matches a match for the first, followed by a
- match for the second, etc.
+ A branch is zero or more <firstterm>quantified atoms</> or
+ <firstterm>constraints</>, concatenated.
+ It matches a match for the first, followed by a match for the second, etc;
+ an empty branch matches the empty string.
</para>
<para>
- A piece is an <firstterm>atom</firstterm> possibly followed by a
- single <literal>*</literal>, <literal>+</literal>,
- <literal>?</literal>, or <firstterm>bound</firstterm>. An atom
- followed by <literal>*</literal> matches a sequence of 0 or more
- matches of the atom. An atom followed by <literal>+</literal>
- matches a sequence of 1 or more matches of the atom. An atom
- followed by <literal>?</literal> matches a sequence of 0 or 1
- matches of the atom.
+ A quantified atom is an <firstterm>atom</> possibly followed
+ by a single <firstterm>quantifier</>.
+ Without a quantifier, it matches a match for the atom.
+ With a quantifier, it can match some number of matches of the atom.
+ An <firstterm>atom</firstterm> can be any of the possibilities
+ shown in <xref linkend="posix-atoms-table">.
+ The possible quantifiers and their meanings are shown in
+ <xref linkend="posix-quantifiers-table">.
</para>
<para>
- A <firstterm>bound</firstterm> is <literal>{</literal> followed by
- an unsigned decimal integer, possibly followed by
- <literal>,</literal> possibly followed by another unsigned decimal
- integer, always followed by <literal>}</literal>. The integers
- must lie between 0 and <symbol>RE_DUP_MAX</symbol> (255)
- inclusive, and if there are two of them, the first may not exceed
- the second. An atom followed by a bound containing one integer
- <replaceable>i</replaceable> and no comma matches a sequence of
- exactly <replaceable>i</replaceable> matches of the atom. An atom
- followed by a bound containing one integer
- <replaceable>i</replaceable> and a comma matches a sequence of
- <replaceable>i</replaceable> or more matches of the atom. An atom
- followed by a bound containing two integers
- <replaceable>i</replaceable> and <replaceable>j</replaceable>
- matches a sequence of <replaceable>i</replaceable> through
- <replaceable>j</replaceable> (inclusive) matches of the atom.
+ A <firstterm>constraint</> matches an empty string, but matches only when
+ specific conditions are met. A constraint can be used where an atom
+ could be used, except it may not be followed by a quantifier.
+ The simple constraints are shown in
+ <xref linkend="posix-constraints-table">;
+ some more constraints are described later.
+ </para>
+
+
+ <table id="posix-atoms-table">
+ <title>Regular Expression Atoms</title>
+
+ <tgroup cols="2">
+ <thead>
+ <row>
+ <entry>Atom</entry>
+ <entry>Description</entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry> <literal>(</><replaceable>re</><literal>)</> </entry>
+ <entry> (where <replaceable>re</> is any regular expression)
+ matches a match for
+ <replaceable>re</>, with the match noted for possible reporting </entry>
+ </row>
+
+ <row>
+ <entry> <literal>(?:</><replaceable>re</><literal>)</> </entry>
+ <entry> as above, but the match is not noted for reporting
+ (a <quote>non-capturing</> set of parentheses)
+ (AREs only) </entry>
+ </row>
+
+ <row>
+ <entry> <literal>.</> </entry>
+ <entry> matches any single character </entry>
+ </row>
+
+ <row>
+ <entry> <literal>[</><replaceable>chars</><literal>]</> </entry>
+ <entry> a <firstterm>bracket expression</>,
+ matching any one of the <replaceable>chars</> (see
+ <xref linkend="posix-bracket-expressions"> for more detail) </entry>
+ </row>
+
+ <row>
+ <entry> <literal>\</><replaceable>k</> </entry>
+ <entry> (where <replaceable>k</> is a non-alphanumeric character)
+ matches that character taken as an ordinary character,
+ e.g. <literal>\\</> matches a backslash character </entry>
+ </row>
+
+ <row>
+ <entry> <literal>\</><replaceable>c</> </entry>
+ <entry> where <replaceable>c</> is alphanumeric
+ (possibly followed by other characters)
+ is an <firstterm>escape</>, see <xref linkend="posix-escape-sequences">
+ (AREs only; in EREs and BREs, this matches <replaceable>c</>) </entry>
+ </row>
+
+ <row>
+ <entry> <literal>{</> </entry>
+ <entry> when followed by a character other than a digit,
+ matches the left-brace character <literal>{</>;
+ when followed by a digit, it is the beginning of a
+ <replaceable>bound</> (see below) </entry>
+ </row>
+
+ <row>
+ <entry> <replaceable>x</> </entry>
+ <entry> where <replaceable>x</> is a single character with no other
+ significance, matches that character </entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
+
+ <para>
+ An RE may not end with <literal>\</>.
</para>
<note>
<para>
- A repetition operator (<literal>?</literal>,
- <literal>*</literal>, <literal>+</literal>, or bounds) cannot
- follow another repetition operator. A repetition operator cannot
+ Remember that the backslash (<literal>\</literal>) already has a special
+ meaning in <productname>PostgreSQL</> string literals.
+ To write a pattern constant that contains a backslash,
+ you must write two backslashes in the query.
+ </para>
+ </note>
+
+ <table id="posix-quantifiers-table">
+ <title>Regular Expression Quantifiers</title>
+
+ <tgroup cols="2">
+ <thead>
+ <row>
+ <entry>Quantifier</entry>
+ <entry>Matches</entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry> <literal>*</> </entry>
+ <entry> a sequence of 0 or more matches of the atom </entry>
+ </row>
+
+ <row>
+ <entry> <literal>+</> </entry>
+ <entry> a sequence of 1 or more matches of the atom </entry>
+ </row>
+
+ <row>
+ <entry> <literal>?</> </entry>
+ <entry> a sequence of 0 or 1 matches of the atom </entry>
+ </row>
+
+ <row>
+ <entry> <literal>{</><replaceable>m</><literal>}</> </entry>
+ <entry> a sequence of exactly <replaceable>m</> matches of the atom </entry>
+ </row>
+
+ <row>
+ <entry> <literal>{</><replaceable>m</><literal>,}</> </entry>
+ <entry> a sequence of <replaceable>m</> or more matches of the atom </entry>
+ </row>
+
+ <row>
+ <entry>
+ <literal>{</><replaceable>m</><literal>,</><replaceable>n</><literal>}</> </entry>
+ <entry> a sequence of <replaceable>m</> through <replaceable>n</>
+ (inclusive) matches of the atom; <replaceable>m</> may not exceed
+ <replaceable>n</> </entry>
+ </row>
+
+ <row>
+ <entry> <literal>*?</> </entry>
+ <entry> non-greedy version of <literal>*</> </entry>
+ </row>
+
+ <row>
+ <entry> <literal>+?</> </entry>
+ <entry> non-greedy version of <literal>+</> </entry>
+ </row>
+
+ <row>
+ <entry> <literal>??</> </entry>
+ <entry> non-greedy version of <literal>?</> </entry>
+ </row>
+
+ <row>
+ <entry> <literal>{</><replaceable>m</><literal>}?</> </entry>
+ <entry> non-greedy version of <literal>{</><replaceable>m</><literal>}</> </entry>
+ </row>
+
+ <row>
+ <entry> <literal>{</><replaceable>m</><literal>,}?</> </entry>
+ <entry> non-greedy version of <literal>{</><replaceable>m</><literal>,}</> </entry>
+ </row>
+
+ <row>
+ <entry>
+ <literal>{</><replaceable>m</><literal>,</><replaceable>n</><literal>}?</> </entry>
+ <entry> non-greedy version of <literal>{</><replaceable>m</><literal>,</><replaceable>n</><literal>}</> </entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
+
+ <para>
+ The forms using <literal>{</><replaceable>...</><literal>}</>
+ are known as <firstterm>bound</>s.
+ The numbers <replaceable>m</> and <replaceable>n</> within a bound are
+ unsigned decimal integers with permissible values from 0 to 255 inclusive.
+ </para>
+
+ <para>
+ <firstterm>Non-greedy</> quantifiers (available in AREs only) match the
+ same possibilities as their corresponding normal (<firstterm>greedy</>)
+ counterparts, but prefer the smallest number rather than the largest
+ number of matches.
+ See <xref linkend="posix-matching-rules"> for more detail.
+ </para>
+
+ <note>
+ <para>
+ A quantifier cannot immediately follow another quantifier.
+ A quantifier cannot
begin an expression or subexpression or follow
<literal>^</literal> or <literal>|</literal>.
</para>
</note>
- <para>
- An <firstterm>atom</firstterm> is a regular expression enclosed in
- <literal>()</literal> (matching a match for the regular
- expression), an empty set of <literal>()</literal> (matching the
- null string), a <firstterm>bracket expression</firstterm> (see
- below), <literal>.</literal> (matching any single character),
- <literal>^</literal> (matching the null string at the beginning of the
- input string), <literal>$</literal> (matching the null string at the end
- of the input string), a <literal>\</literal> followed by one of the
- characters <literal>^.[$()|*+?{\</literal> (matching that
- character taken as an ordinary character), a <literal>\</literal>
- followed by any other character (matching that character taken as
- an ordinary character, as if the <literal>\</literal> had not been
- present), or a single character with no other significance
- (matching that character). A <literal>{</literal> followed by a
- character other than a digit is an ordinary character, not the
- beginning of a bound. It is illegal to end an RE with
- <literal>\</literal>.
- </para>
+ <table id="posix-constraints-table">
+ <title>Regular Expression Constraints</title>
+
+ <tgroup cols="2">
+ <thead>
+ <row>
+ <entry>Constraint</entry>
+ <entry>Description</entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry> <literal>^</> </entry>
+ <entry> matches at the beginning of the string </entry>
+ </row>
+
+ <row>
+ <entry> <literal>$</> </entry>
+ <entry> matches at the end of the string </entry>
+ </row>
+
+ <row>
+ <entry> <literal>(?=</><replaceable>re</><literal>)</> </entry>
+ <entry> <firstterm>positive lookahead</> matches at any point
+ where a substring matching <replaceable>re</> begins
+ (AREs only) </entry>
+ </row>
+
+ <row>
+ <entry> <literal>(?!</><replaceable>re</><literal>)</> </entry>
+ <entry> <firstterm>negative lookahead</> matches at any point
+ where no substring matching <replaceable>re</> begins
+ (AREs only) </entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
<para>
- Note that the backslash (<literal>\</literal>) already has a special
- meaning in string
- literals, so to write a pattern constant that contains a backslash
- you must write two backslashes in the query.
+ Lookahead constraints may not contain <firstterm>back references</>
+ (see <xref linkend="posix-escape-sequences">),
+ and all parentheses within them are considered non-capturing.
</para>
+ </sect3>
+
+ <sect3 id="posix-bracket-expressions">
+ <title>Bracket Expressions</title>
<para>
A <firstterm>bracket expression</firstterm> is a list of
characters enclosed in <literal>[]</literal>. It normally matches
any single character from the list (but see below). If the list
begins with <literal>^</literal>, it matches any single character
- (but see below) not from the rest of the list. If two characters
+ <emphasis>not</> from the rest of the list.
+ If two characters
in the list are separated by <literal>-</literal>, this is
shorthand for the full range of characters between those two
(inclusive) in the collating sequence,
e.g. <literal>[0-9]</literal> in <acronym>ASCII</acronym> matches
any decimal digit. It is illegal for two ranges to share an
endpoint, e.g. <literal>a-c-e</literal>. Ranges are very
- collating-sequence-dependent, and portable programs should avoid
+ collating-sequence-dependent, so portable programs should avoid
relying on them.
</para>
character, or the second endpoint of a range. To use a literal
<literal>-</literal> as the first endpoint of a range, enclose it
in <literal>[.</literal> and <literal>.]</literal> to make it a
- collating element (see below). With the exception of these and
- some combinations using <literal>[</literal> (see next
- paragraphs), all other special characters, including
- <literal>\</literal>, lose their special significance within a
- bracket expression.
+ collating element (see below). With the exception of these characters,
+ some combinations using <literal>[</literal>
+ (see next paragraphs), and escapes (AREs only), all other special
+ characters lose their special significance within a bracket expression.
+ In particular, <literal>\</literal> is not special when following
+ ERE or BRE rules, though it is special (as introducing an escape)
+ in AREs.
</para>
<para>
<literal>chchcc</literal>.
</para>
+ <note>
+ <para>
+ <productname>PostgreSQL</> currently has no multi-character collating
+ elements. This information describes possible future behavior.
+ </para>
+ </note>
+
<para>
Within a bracket expression, a collating element enclosed in
<literal>[=</literal> and <literal>=]</literal> is an equivalence
<para>
There are two special cases of bracket expressions: the bracket
expressions <literal>[[:<:]]</literal> and
- <literal>[[:>:]]</literal> match the null string at the beginning
+ <literal>[[:>:]]</literal> are constraints,
+ matching empty strings at the beginning
and end of a word respectively. A word is defined as a sequence
- of word characters which is neither preceded nor followed by word
- characters. A word character is an alnum character (as defined by
+ of word characters that is neither preceded nor followed by word
+ characters. A word character is an <literal>alnum</> character (as
+ defined by
<citerefentry><refentrytitle>ctype</refentrytitle><manvolnum>3</manvolnum></citerefentry>)
or an underscore. This is an extension, compatible with but not
- specified by <acronym>POSIX</acronym> 1003.2, and should be used with caution in
- software intended to be portable to other systems.
+ specified by <acronym>POSIX</acronym> 1003.2, and should be used with
+ caution in software intended to be portable to other systems.
+ The constraint escapes described below are usually preferable (they
+ are no more standard, but are certainly easier to type).
+ </para>
+ </sect3>
+
+ <sect3 id="posix-escape-sequences">
+ <title>Regular Expression Escapes</title>
+
+ <para>
+ <firstterm>Escapes</> are special sequences beginning with <literal>\</>
+ followed by an alphanumeric character. Escapes come in several varieties:
+ character entry, class shorthands, constraint escapes, and back references.
+ A <literal>\</> followed by an alphanumeric character but not constituting
+ a valid escape is illegal in AREs.
+ In EREs, there are no escapes: outside a bracket expression,
+ a <literal>\</> followed by an alphanumeric character merely stands for
+ that character as an ordinary character, and inside a bracket expression,
+ <literal>\</> is an ordinary character.
+ (The latter is the one actual incompatibility between EREs and AREs.)
+ </para>
+
+ <para>
+ <firstterm>Character-entry escapes</> exist to make it easier to specify
+ non-printing and otherwise inconvenient characters in REs. They are
+ shown in <xref linkend="posix-character-entry-escapes-table">.
+ </para>
+
+ <para>
+ <firstterm>Class-shorthand escapes</> provide shorthands for certain
+ commonly-used character classes. They are
+ shown in <xref linkend="posix-class-shorthand-escapes-table">.
+ </para>
+
+ <para>
+ A <firstterm>constraint escape</> is a constraint,
+ matching the empty string if specific conditions are met,
+ written as an escape. They are
+ shown in <xref linkend="posix-constraint-escapes-table">.
+ </para>
+
+ <para>
+ A <firstterm>back reference</> (<literal>\</><replaceable>n</>) matches the
+ same string matched by the previous parenthesized subexpression specified
+ by the number <replaceable>n</>
+ (see <xref linkend="posix-constraint-backref-table">). For example,
+ <literal>([bc])\1</> matches <literal>bb</> or <literal>cc</>
+ but not <literal>bc</> or <literal>cb</>.
+ The subexpression must entirely precede the back reference in the RE.
+ Subexpressions are numbered in the order of their leading parentheses.
+ Non-capturing parentheses do not define subexpressions.
+ </para>
+
+ <note>
+ <para>
+ Keep in mind that an escape's leading <literal>\</> will need to be
+ doubled when entering the pattern as an SQL string constant.
+ </para>
+ </note>
+
+ <table id="posix-character-entry-escapes-table">
+ <title>Regular Expression Character-Entry Escapes</title>
+
+ <tgroup cols="2">
+ <thead>
+ <row>
+ <entry>Escape</entry>
+ <entry>Description</entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry> <literal>\a</> </entry>
+ <entry> alert (bell) character, as in C </entry>
+ </row>
+
+ <row>
+ <entry> <literal>\b</> </entry>
+ <entry> backspace, as in C </entry>
+ </row>
+
+ <row>
+ <entry> <literal>\B</> </entry>
+ <entry> synonym for <literal>\</> to help reduce the need for backslash
+ doubling </entry>
+ </row>
+
+ <row>
+ <entry> <literal>\c</><replaceable>X</> </entry>
+ <entry> (where <replaceable>X</> is any character) the character whose
+ low-order 5 bits are the same as those of
+ <replaceable>X</>, and whose other bits are all zero </entry>
+ </row>
+
+ <row>
+ <entry> <literal>\e</> </entry>
+ <entry> the character whose collating-sequence name
+ is <literal>ESC</>,
+ or failing that, the character with octal value 033 </entry>
+ </row>
+
+ <row>
+ <entry> <literal>\f</> </entry>
+ <entry> formfeed, as in C </entry>
+ </row>
+
+ <row>
+ <entry> <literal>\n</> </entry>
+ <entry> newline, as in C </entry>
+ </row>
+
+ <row>
+ <entry> <literal>\r</> </entry>
+ <entry> carriage return, as in C </entry>
+ </row>
+
+ <row>
+ <entry> <literal>\t</> </entry>
+ <entry> horizontal tab, as in C </entry>
+ </row>
+
+ <row>
+ <entry> <literal>\u</><replaceable>wxyz</> </entry>
+ <entry> (where <replaceable>wxyz</> is exactly four hexadecimal digits)
+ the Unicode character <literal>U+</><replaceable>wxyz</>
+ in the local byte ordering </entry>
+ </row>
+
+ <row>
+ <entry> <literal>\U</><replaceable>stuvwxyz</> </entry>
+ <entry> (where <replaceable>stuvwxyz</> is exactly eight hexadecimal
+ digits)
+ reserved for a somewhat-hypothetical Unicode extension to 32 bits
+ </entry>
+ </row>
+
+ <row>
+ <entry> <literal>\v</> </entry>
+ <entry> vertical tab, as in C </entry>
+ </row>
+
+ <row>
+ <entry> <literal>\x</><replaceable>hhh</> </entry>
+ <entry> (where <replaceable>hhh</> is any sequence of hexadecimal
+ digits)
+ the character whose hexadecimal value is
+ <literal>0x</><replaceable>hhh</>
+ (a single character no matter how many hexadecimal digits are used)
+ </entry>
+ </row>
+
+ <row>
+ <entry> <literal>\0</> </entry>
+ <entry> the character whose value is <literal>0</> </entry>
+ </row>
+
+ <row>
+ <entry> <literal>\</><replaceable>xy</> </entry>
+ <entry> (where <replaceable>xy</> is exactly two octal digits,
+ and is not a <firstterm>back reference</>)
+ the character whose octal value is
+ <literal>0</><replaceable>xy</> </entry>
+ </row>
+
+ <row>
+ <entry> <literal>\</><replaceable>xyz</> </entry>
+ <entry> (where <replaceable>xyz</> is exactly three octal digits,
+ and is not a <firstterm>back reference</>)
+ the character whose octal value is
+ <literal>0</><replaceable>xyz</> </entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
+
+ <para>
+ Hexadecimal digits are <literal>0</>-<literal>9</>,
+ <literal>a</>-<literal>f</>, and <literal>A</>-<literal>F</>.
+ Octal digits are <literal>0</>-<literal>7</>.
+ </para>
+
+ <para>
+ The character-entry escapes are always taken as ordinary characters.
+ For example, <literal>\135</> is <literal>]</> in ASCII, but
+ <literal>\135</> does not terminate a bracket expression.
</para>
+ <table id="posix-class-shorthand-escapes-table">
+ <title>Regular Expression Class-Shorthand Escapes</title>
+
+ <tgroup cols="2">
+ <thead>
+ <row>
+ <entry>Escape</entry>
+ <entry>Description</entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry> <literal>\d</> </entry>
+ <entry> <literal>[[:digit:]]</> </entry>
+ </row>
+
+ <row>
+ <entry> <literal>\s</> </entry>
+ <entry> <literal>[[:space:]]</> </entry>
+ </row>
+
+ <row>
+ <entry> <literal>\w</> </entry>
+ <entry> <literal>[[:alnum:]_]</>
+ (note underscore is included) </entry>
+ </row>
+
+ <row>
+ <entry> <literal>\D</> </entry>
+ <entry> <literal>[^[:digit:]]</> </entry>
+ </row>
+
+ <row>
+ <entry> <literal>\S</> </entry>
+ <entry> <literal>[^[:space:]]</> </entry>
+ </row>
+
+ <row>
+ <entry> <literal>\W</> </entry>
+ <entry> <literal>[^[:alnum:]_]</>
+ (note underscore is included) </entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
+
<para>
- In the event that an RE could match more than one substring of a
- given string, the RE matches the one starting earliest in the
- string. If the RE could match more than one substring starting at
- that point, it matches the longest. Subexpressions also match the
- longest possible substrings, subject to the constraint that the
- whole match be as long as possible, with subexpressions starting
- earlier in the RE taking priority over ones starting later. Note
- that higher-level subexpressions thus take priority over their
- lower-level component subexpressions.
+ Within bracket expressions, <literal>\d</>, <literal>\s</>,
+ and <literal>\w</> lose their outer brackets,
+ and <literal>\D</>, <literal>\S</>, and <literal>\W</> are illegal.
+ (So, for example, <literal>[a-c\d]</> is equivalent to
+ <literal>[a-c[:digit:]]</>.
+ Also, <literal>[a-c\D]</>, which is equivalent to
+ <literal>[a-c^[:digit:]]</>, is illegal.)
</para>
+ <table id="posix-constraint-escapes-table">
+ <title>Regular Expression Constraint Escapes</title>
+
+ <tgroup cols="2">
+ <thead>
+ <row>
+ <entry>Escape</entry>
+ <entry>Description</entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry> <literal>\A</> </entry>
+ <entry> matches only at the beginning of the string
+ (see <xref linkend="posix-matching-rules"> for how this differs from
+ <literal>^</>) </entry>
+ </row>
+
+ <row>
+ <entry> <literal>\m</> </entry>
+ <entry> matches only at the beginning of a word </entry>
+ </row>
+
+ <row>
+ <entry> <literal>\M</> </entry>
+ <entry> matches only at the end of a word </entry>
+ </row>
+
+ <row>
+ <entry> <literal>\y</> </entry>
+ <entry> matches only at the beginning or end of a word </entry>
+ </row>
+
+ <row>
+ <entry> <literal>\Y</> </entry>
+ <entry> matches only at a point that is not the beginning or end of a
+ word </entry>
+ </row>
+
+ <row>
+ <entry> <literal>\Z</> </entry>
+ <entry> matches only at the end of the string
+ (see <xref linkend="posix-matching-rules"> for how this differs from
+ <literal>$</>) </entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
+
+ <para>
+ A word is defined as in the specification of
+ <literal>[[:<:]]</> and <literal>[[:>:]]</> above.
+ Constraint escapes are illegal within bracket expressions.
+ </para>
+
+ <table id="posix-constraint-backref-table">
+ <title>Regular Expression Back References</title>
+
+ <tgroup cols="2">
+ <thead>
+ <row>
+ <entry>Escape</entry>
+ <entry>Description</entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry> <literal>\</><replaceable>m</> </entry>
+ <entry> (where <replaceable>m</> is a nonzero digit)
+ a back reference to the <replaceable>m</>'th subexpression </entry>
+ </row>
+
+ <row>
+ <entry> <literal>\</><replaceable>mnn</> </entry>
+ <entry> (where <replaceable>m</> is a nonzero digit, and
+ <replaceable>nn</> is some more digits, and the decimal value
+ <replaceable>mnn</> is not greater than the number of closing capturing
+ parentheses seen so far)
+ a back reference to the <replaceable>mnn</>'th subexpression </entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
+
+ <note>
+ <para>
+ There is an inherent historical ambiguity between octal character-entry
+ escapes and back references, which is resolved by heuristics,
+ as hinted at above.
+ A leading zero always indicates an octal escape.
+ A single non-zero digit, not followed by another digit,
+ is always taken as a back reference.
+ A multi-digit sequence not starting with a zero is taken as a back
+ reference if it comes after a suitable subexpression
+ (i.e. the number is in the legal range for a back reference),
+ and otherwise is taken as octal.
+ </para>
+ </note>
+ </sect3>
+
+ <sect3 id="posix-metasyntax">
+ <title>Regular Expression Metasyntax</title>
+
<para>
- Match lengths are measured in characters, not collating
- elements. A null string is considered longer than no match at
- all. For example, <literal>bb*</literal> matches the three middle
- characters of <literal>abbbc</literal>,
- <literal>(wee|week)(knights|nights)</literal> matches all ten
- characters of <literal>weeknights</literal>, when
- <literal>(.*).*</literal> is matched against
- <literal>abc</literal> the parenthesized subexpression matches all
- three characters, and when <literal>(a*)*</literal> is matched
- against <literal>bc</literal> both the whole RE and the
- parenthesized subexpression match the null string.
+ In addition to the main syntax described above, there are some special
+ forms and miscellaneous syntactic facilities available.
</para>
<para>
- If case-independent matching is specified, the effect is much as
- if all case distinctions had vanished from the alphabet. When an
- alphabetic that exists in multiple cases appears as an ordinary
- character outside a bracket expression, it is effectively
+ Normally the flavor of RE being used is specified by
+ application-dependent means.
+ However, this can be overridden by a <firstterm>director</>.
+ If an RE of any flavor begins with <literal>***:</>,
+ the rest of the RE is an ARE.
+ If an RE of any flavor begins with <literal>***=</>,
+ the rest of the RE is taken to be a literal string,
+ with all characters considered ordinary characters.
+ </para>
+
+ <para>
+ An ARE may begin with <firstterm>embedded options</>:
+ a sequence <literal>(?</><replaceable>xyz</><literal>)</>
+ (where <replaceable>xyz</> is one or more alphabetic characters)
+ specifies options affecting the rest of the RE.
+ These supplement, and can override,
+ any options specified externally.
+ The available option letters are
+ shown in <xref linkend="posix-embedded-options-table">.
+ </para>
+
+ <table id="posix-embedded-options-table">
+ <title>ARE Embedded-Option Letters</title>
+
+ <tgroup cols="2">
+ <thead>
+ <row>
+ <entry>Option</entry>
+ <entry>Description</entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry> <literal>b</> </entry>
+ <entry> rest of RE is a BRE </entry>
+ </row>
+
+ <row>
+ <entry> <literal>c</> </entry>
+ <entry> case-sensitive matching (usual default) </entry>
+ </row>
+
+ <row>
+ <entry> <literal>e</> </entry>
+ <entry> rest of RE is an ERE </entry>
+ </row>
+
+ <row>
+ <entry> <literal>i</> </entry>
+ <entry> case-insensitive matching (see
+ <xref linkend="posix-matching-rules">) </entry>
+ </row>
+
+ <row>
+ <entry> <literal>m</> </entry>
+ <entry> historical synonym for <literal>n</> </entry>
+ </row>
+
+ <row>
+ <entry> <literal>n</> </entry>
+ <entry> newline-sensitive matching (see
+ <xref linkend="posix-matching-rules">) </entry>
+ </row>
+
+ <row>
+ <entry> <literal>p</> </entry>
+ <entry> partial newline-sensitive matching (see
+ <xref linkend="posix-matching-rules">) </entry>
+ </row>
+
+ <row>
+ <entry> <literal>q</> </entry>
+ <entry> rest of RE is a literal (<quote>quoted</>) string, all ordinary
+ characters </entry>
+ </row>
+
+ <row>
+ <entry> <literal>s</> </entry>
+ <entry> non-newline-sensitive matching (usual default) </entry>
+ </row>
+
+ <row>
+ <entry> <literal>t</> </entry>
+ <entry> tight syntax (usual default; see below) </entry>
+ </row>
+
+ <row>
+ <entry> <literal>w</> </entry>
+ <entry> inverse partial newline-sensitive (<quote>weird</>) matching
+ (see <xref linkend="posix-matching-rules">) </entry>
+ </row>
+
+ <row>
+ <entry> <literal>x</> </entry>
+ <entry> expanded syntax (see below) </entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
+
+ <para>
+ Embedded options take effect at the <literal>)</> terminating the sequence.
+ They are available only at the start of an ARE,
+ and may not be used later within it.
+ </para>
+
+ <para>
+ In addition to the usual (<firstterm>tight</>) RE syntax, in which all
+ characters are significant, there is an <firstterm>expanded</> syntax,
+ available by specifying the embedded <literal>x</> option.
+ In the expanded syntax,
+ white-space characters in the RE are ignored, as are
+ all characters between a <literal>#</>
+ and the following newline (or the end of the RE). This
+ permits paragraphing and commenting a complex RE.
+ There are three exceptions to that basic rule:
+
+ <itemizedlist>
+ <listitem>
+ <para>
+ a white-space character or <literal>#</> preceded by <literal>\</> is
+ retained
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ white space or <literal>#</> within a bracket expression is retained
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ white space and comments are illegal within multi-character symbols,
+ like the ARE <literal>(?:</> or the BRE <literal>\(</>
+ </para>
+ </listitem>
+ </itemizedlist>
+
+ Expanded-syntax white-space characters are blank, tab, newline, and
+ any character that belongs to the <replaceable>space</> character class.
+ </para>
+
+ <para>
+ Finally, in an ARE, outside bracket expressions, the sequence
+ <literal>(?#</><replaceable>ttt</><literal>)</>
+ (where <replaceable>ttt</> is any text not containing a <literal>)</>)
+ is a comment, completely ignored.
+ Again, this is not allowed between the characters of
+ multi-character symbols, like <literal>(?:</>.
+ Such comments are more a historical artifact than a useful facility,
+ and their use is deprecated; use the expanded syntax instead.
+ </para>
+
+ <para>
+ <emphasis>None</> of these metasyntax extensions is available if
+ an initial <literal>***=</> director
+ has specified that the user's input be treated as a literal string
+ rather than as an RE.
+ </para>
+ </sect3>
+
+ <sect3 id="posix-matching-rules">
+ <title>Regular Expression Matching Rules</title>
+
+ <para>
+ In the event that an RE could match more than one substring of a given
+ string, the RE matches the one starting earliest in the string.
+ If the RE could match more than one substring starting at that point,
+ its choice is determined by its <firstterm>preference</>:
+ either the longest substring, or the shortest.
+ </para>
+
+ <para>
+ Most atoms, and all constraints, have no preference.
+ A parenthesized RE has the same preference (possibly none) as the RE.
+ A quantified atom with quantifier
+ <literal>{</><replaceable>m</><literal>}</>
+ or
+ <literal>{</><replaceable>m</><literal>}?</>
+ has the same preference (possibly none) as the atom itself.
+ A quantified atom with other normal quantifiers (including
+ <literal>{</><replaceable>m</><literal>,</><replaceable>n</><literal>}</>
+ with <replaceable>m</> equal to <replaceable>n</>)
+ prefers longest match.
+ A quantified atom with other non-greedy quantifiers (including
+ <literal>{</><replaceable>m</><literal>,</><replaceable>n</><literal>}?</>
+ with <replaceable>m</> equal to <replaceable>n</>)
+ prefers shortest match.
+ A branch has the same preference as the first quantified atom in it
+ which has a preference.
+ An RE consisting of two or more branches connected by the
+ <literal>|</> operator prefers longest match.
+ </para>
+
+ <para>
+ Subject to the constraints imposed by the rules for matching the whole RE,
+ subexpressions also match the longest or shortest possible substrings,
+ based on their preferences,
+ with subexpressions starting earlier in the RE taking priority over
+ ones starting later.
+ Note that outer subexpressions thus take priority over
+ their component subexpressions.
+ </para>
+
+ <para>
+ The quantifiers <literal>{1,1}</> and <literal>{1,1}?</>
+ can be used to force longest and shortest preference, respectively,
+ on a subexpression or a whole RE.
+ </para>
+
+ <para>
+ Match lengths are measured in characters, not collating elements.
+ An empty string is considered longer than no match at all.
+ For example:
+ <literal>bb*</>
+ matches the three middle characters of <literal>abbbc</>;
+ <literal>(week|wee)(night|knights)</>
+ matches all ten characters of <literal>weeknights</>;
+ when <literal>(.*).*</>
+ is matched against <literal>abc</> the parenthesized subexpression
+ matches all three characters; and when
+ <literal>(a*)*</> is matched against <literal>bc</>
+ both the whole RE and the parenthesized
+ subexpression match an empty string.
+ </para>
+
+ <para>
+ If case-independent matching is specified,
+ the effect is much as if all case distinctions had vanished from the
+ alphabet.
+ When an alphabetic that exists in multiple cases appears as an
+ ordinary character outside a bracket expression, it is effectively
transformed into a bracket expression containing both cases,
- e.g. <literal>x</literal> becomes <literal>[xX]</literal>. When
- it appears inside a bracket expression, all case counterparts of
- it are added to the bracket expression, so that (e.g.)
- <literal>[x]</literal> becomes <literal>[xX]</literal> and
- <literal>[^x]</literal> becomes <literal>[^xX]</literal>.
+ e.g. <literal>x</> becomes <literal>[xX]</>.
+ When it appears inside a bracket expression, all case counterparts
+ of it are added to the bracket expression, e.g.
+ <literal>[x]</> becomes <literal>[xX]</>
+ and <literal>[^x]</> becomes <literal>[^xX]</>.
</para>
<para>
- There is no particular limit on the length of <acronym>RE</acronym>s, except insofar
- as memory is limited. Memory usage is approximately linear in RE
- size, and largely insensitive to RE complexity, except for bounded
- repetitions. Bounded repetitions are implemented by macro
- expansion, which is costly in time and space if counts are large
- or bounded repetitions are nested. An RE like, say,
- <literal>((((a{1,100}){1,100}){1,100}){1,100}){1,100}</literal>
- will (eventually) run almost any existing machine out of swap
- space.
- <footnote>
- <para>
- This was written in 1994, mind you. The
- numbers have probably changed, but the problem
- persists.
- </para>
- </footnote>
+ If newline-sensitive matching is specified, <literal>.</>
+ and bracket expressions using <literal>^</>
+ will never match the newline character
+ (so that matches will never cross newlines unless the RE
+ explicitly arranges it)
+ and <literal>^</>and <literal>$</>
+ will match the empty string after and before a newline
+ respectively, in addition to matching at beginning and end of string
+ respectively.
+ But the ARE escapes <literal>\A</> and <literal>\Z</>
+ continue to match beginning or end of string <emphasis>only</>.
</para>
-<!-- end re_format.7 man page -->
- </sect2>
+ <para>
+ If partial newline-sensitive matching is specified,
+ this affects <literal>.</> and bracket expressions
+ as with newline-sensitive matching, but not <literal>^</>
+ and <literal>$</>.
+ </para>
+
+ <para>
+ If inverse partial newline-sensitive matching is specified,
+ this affects <literal>^</> and <literal>$</>
+ as with newline-sensitive matching, but not <literal>.</>
+ and bracket expressions.
+ This isn't very useful but is provided for symmetry.
+ </para>
+ </sect3>
+
+ <sect3 id="posix-limits-compatibility">
+ <title>Limits and Compatibility</title>
+
+ <para>
+ No particular limit is imposed on the length of REs in this
+ implementation. However,
+ programs intended to be highly portable should not employ REs longer
+ than 256 bytes,
+ as a POSIX-compliant implementation can refuse to accept such REs.
+ </para>
+
+ <para>
+ The only feature of AREs that is actually incompatible with
+ POSIX EREs is that <literal>\</> does not lose its special
+ significance inside bracket expressions.
+ All other ARE features use syntax which is illegal or has
+ undefined or unspecified effects in POSIX EREs;
+ the <literal>***</> syntax of directors likewise is outside the POSIX
+ syntax for both BREs and EREs.
+ </para>
+
+ <para>
+ Many of the ARE extensions are borrowed from Perl, but some have
+ been changed to clean them up, and a few Perl extensions are not present.
+ Incompatibilities of note include <literal>\b</>, <literal>\B</>,
+ the lack of special treatment for a trailing newline,
+ the addition of complemented bracket expressions to the things
+ affected by newline-sensitive matching,
+ the restrictions on parentheses and back references in lookahead
+ constraints, and the longest/shortest-match (rather than first-match)
+ matching semantics.
+ </para>
+
+ <para>
+ Two significant incompatibilites exist between AREs and the ERE syntax
+ recognized by pre-7.4 releases of <productname>PostgreSQL</>:
+
+ <itemizedlist>
+ <listitem>
+ <para>
+ In AREs, <literal>\</> followed by an alphanumeric character is either
+ an escape or an error, while in previous releases, it was just another
+ way of writing the alphanumeric.
+ This should not be much of a problem because there was no reason to
+ write such a sequence in earlier releases.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ In AREs, <literal>\</> remains a special character within
+ <literal>[]</>, so a literal <literal>\</> within a bracket
+ expression must be written <literal>\\</>.
+ </para>
+ </listitem>
+ </itemizedlist>
+ </para>
+ </sect3>
+
+ <sect3 id="posix-basic-regexes">
+ <title>Basic Regular Expressions</title>
+
+ <para>
+ BREs differ from EREs in several respects.
+ <literal>|</>, <literal>+</>, and <literal>?</>
+ are ordinary characters and there is no equivalent
+ for their functionality.
+ The delimiters for bounds are
+ <literal>\{</> and <literal>\}</>,
+ with <literal>{</> and <literal>}</>
+ by themselves ordinary characters.
+ The parentheses for nested subexpressions are
+ <literal>\(</> and <literal>\)</>,
+ with <literal>(</> and <literal>)</> by themselves ordinary characters.
+ <literal>^</> is an ordinary character except at the beginning of the
+ RE or the beginning of a parenthesized subexpression,
+ <literal>$</> is an ordinary character except at the end of the
+ RE or the end of a parenthesized subexpression,
+ and <literal>*</> is an ordinary character if it appears at the beginning
+ of the RE or the beginning of a parenthesized subexpression
+ (after a possible leading <literal>^</>).
+ Finally, single-digit back references are available, and
+ <literal>\<</> and <literal>\></>
+ are synonyms for
+ <literal>[[:<:]]</> and <literal>[[:>:]]</>
+ respectively; no other escapes are available.
+ </para>
+ </sect3>
+
+<!-- end re_syntax.n man page -->
+
+ </sect2>
</sect1>
<!--
-$Header: /cvsroot/pgsql/doc/src/sgml/release.sgml,v 1.184 2003/02/02 23:46:38 tgl Exp $
+$Header: /cvsroot/pgsql/doc/src/sgml/release.sgml,v 1.185 2003/02/05 17:41:32 tgl Exp $
-->
<appendix id="release">
worries about funny characters.
-->
<literallayout><![CDATA[
+New regular expression package, many more regexp features (most of Perl5)
Can now do EXPLAIN ... EXECUTE to see plan used for a prepared query
Explicit JOINs no longer constrain query plan, unless JOIN_COLLAPSE_LIMIT = 1
Performance of "foo IN (SELECT ...)" queries has been considerably improved
-Copyright 1992, 1993, 1994 Henry Spencer. All rights reserved.
-This software is not subject to any license of the American Telephone
-and Telegraph Company or of the Regents of the University of California.
-
-Permission is granted to anyone to use this software for any purpose on
-any computer system, and to alter it and redistribute it, subject
-to the following restrictions:
-
-1. The author is not responsible for the consequences of use of this
- software, no matter how awful, even if they arise from flaws in it.
-
-2. The origin of this software must not be misrepresented, either by
- explicit claim or by omission. Since few users ever read sources,
- credits must appear in the documentation.
-
-3. Altered versions must be plainly marked as such, and must not be
- misrepresented as being the original software. Since few users
- ever read sources, credits must appear in the documentation.
-
-4. This notice may not be removed or altered.
-
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
-/*-
- * Copyright (c) 1994
- * The Regents of the University of California. All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions
- * are met:
- * 1. Redistributions of source code must retain the above copyright
- * notice, this list of conditions and the following disclaimer.
- * 2. Redistributions in binary form must reproduce the above copyright
- * notice, this list of conditions and the following disclaimer in the
- * documentation and/or other materials provided with the distribution.
- * 3. All advertising materials mentioning features or use of this software
- * must display the following acknowledgement:
- * This product includes software developed by the University of
- * California, Berkeley and its contributors.
- * 4. Neither the name of the University nor the names of its contributors
- * may be used to endorse or promote products derived from this software
- * without specific prior written permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
- * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
- * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
- * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
- * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
- * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
- * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
- * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
- * SUCH DAMAGE.
- *
- * @(#)COPYRIGHT 8.1 (Berkeley) 3/16/94
- */
+This regular expression package was originally developed by Henry Spencer.
+It bears the following copyright notice:
+
+**********************************************************************
+
+Copyright (c) 1998, 1999 Henry Spencer. All rights reserved.
+
+Development of this software was funded, in part, by Cray Research Inc.,
+UUNET Communications Services Inc., Sun Microsystems Inc., and Scriptics
+Corporation, none of whom are responsible for the results. The author
+thanks all of them.
+
+Redistribution and use in source and binary forms -- with or without
+modification -- are permitted for any purpose, provided that
+redistributions in source form retain this entire copyright notice and
+indicate the origin and nature of any modifications.
+
+I'd appreciate being given credit for this package in the documentation
+of software which uses it, but that is not a requirement.
+
+THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES,
+INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL
+HENRY SPENCER BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
+OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
+WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR
+OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
+ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+**********************************************************************
+
+PostgreSQL adopted the code out of Tcl 8.4.1. Portions of regc_locale.c
+and re_syntax.n were developed by Tcl developers other than Henry; these
+files bear the Tcl copyright and license notice:
+
+**********************************************************************
+
+This software is copyrighted by the Regents of the University of
+California, Sun Microsystems, Inc., Scriptics Corporation, ActiveState
+Corporation and other parties. The following terms apply to all files
+associated with the software unless explicitly disclaimed in
+individual files.
+
+The authors hereby grant permission to use, copy, modify, distribute,
+and license this software and its documentation for any purpose, provided
+that existing copyright notices are retained in all copies and that this
+notice is included verbatim in any distributions. No written agreement,
+license, or royalty fee is required for any of the authorized uses.
+Modifications to this software may be copyrighted by their authors
+and need not follow the licensing terms described here, provided that
+the new terms are clearly indicated on the first page of each file where
+they apply.
+
+IN NO EVENT SHALL THE AUTHORS OR DISTRIBUTORS BE LIABLE TO ANY PARTY
+FOR DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES
+ARISING OUT OF THE USE OF THIS SOFTWARE, ITS DOCUMENTATION, OR ANY
+DERIVATIVES THEREOF, EVEN IF THE AUTHORS HAVE BEEN ADVISED OF THE
+POSSIBILITY OF SUCH DAMAGE.
+
+THE AUTHORS AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
+INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE, AND NON-INFRINGEMENT. THIS SOFTWARE
+IS PROVIDED ON AN "AS IS" BASIS, AND THE AUTHORS AND DISTRIBUTORS HAVE
+NO OBLIGATION TO PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR
+MODIFICATIONS.
+
+GOVERNMENT USE: If you are acquiring this software on behalf of the
+U.S. government, the Government shall have only "Restricted Rights"
+in the software and related documentation as defined in the Federal
+Acquisition Regulations (FARs) in Clause 52.227.19 (c) (2). If you
+are acquiring the software on behalf of the Department of Defense, the
+software shall be classified as "Commercial Computer Software" and the
+Government shall have only "Restricted Rights" as defined in Clause
+252.227-7013 (c) (1) of DFARs. Notwithstanding the foregoing, the
+authors grant the U.S. Government and others acting in its behalf
+permission to use and distribute the software in accordance with the
+terms specified in this license.
+
+**********************************************************************
+
+Subsequent modifications to the code by the PostgreSQL project follow
+the same license terms as the rest of PostgreSQL.
#-------------------------------------------------------------------------
#
# Makefile--
-# Makefile for regex
+# Makefile for backend/regex
#
# IDENTIFICATION
-# $Header: /cvsroot/pgsql/src/backend/regex/Makefile,v 1.19 2002/09/16 16:02:43 momjian Exp $
+# $Header: /cvsroot/pgsql/src/backend/regex/Makefile,v 1.20 2003/02/05 17:41:32 tgl Exp $
#
#-------------------------------------------------------------------------
top_builddir = ../../..
include $(top_builddir)/src/Makefile.global
-override CPPFLAGS += -DPOSIX_MISTAKE
-
OBJS = regcomp.o regerror.o regexec.o regfree.o
-DEBUGOBJ += ../utils/mb/SUBSYS.o
all: SUBSYS.o
SUBSYS.o: $(OBJS)
$(LD) $(LDREL) $(LDOUT) SUBSYS.o $(OBJS)
-regexec.o: regexec.c engine.c
+# mark inclusion dependencies between .c files explicitly
+regcomp.o: regcomp.c regc_lex.c regc_color.c regc_nfa.c regc_cvec.c regc_locale.c
-# retest will not compile because multibyte is now enabled by default
-# and the multibyte calls require /mmgr, /adt, and other calls that
-# are complex for linkage, bjm 2002-09-16
-#retest: retest.o SUBSYS.o $(DEBUGOBJ)
-# $(CC) $(CFLAGS) $(LDFLAGS) $^ $(LIBS) -o $@
+regexec.o: regexec.c rege_dfa.c
clean:
- rm -f SUBSYS.o $(OBJS) retest retest.o
+ rm -f SUBSYS.o $(OBJS)
+++ /dev/null
-# @(#)WHATSNEW 8.3 (Berkeley) 3/18/94
-
-New in alpha3.4: The complex bug alluded to below has been fixed (in a
-slightly kludgey temporary way that may hurt efficiency a bit; this is
-another "get it out the door for 4.4" release). The tests at the end of
-the tests file have accordingly been uncommented. The primary sign of
-the bug was that something like a?b matching ab matched b rather than ab.
-(The bug was essentially specific to this exact situation, else it would
-have shown up earlier.)
-
-New in alpha3.3: The definition of word boundaries has been altered
-slightly, to more closely match the usual programming notion that "_"
-is an alphabetic. Stuff used for pre-ANSI systems is now in a subdir,
-and the makefile no longer alludes to it in mysterious ways. The
-makefile has generally been cleaned up some. Fixes have been made
-(again!) so that the regression test will run without -DREDEBUG, at
-the cost of weaker checking. A workaround for a bug in some folks'
-<assert.h> has been added. And some more things have been added to
-tests, including a couple right at the end which are commented out
-because the code currently flunks them (complex bug; fix coming).
-Plus the usual minor cleanup.
-
-New in alpha3.2: Assorted bits of cleanup and portability improvement
-(the development base is now a BSDI system using GCC instead of an ancient
-Sun system, and the newer compiler exposed some glitches). Fix for a
-serious bug that affected REs using many [] (including REG_ICASE REs
-because of the way they are implemented), *sometimes*, depending on
-memory-allocation patterns. The header-file prototypes no longer name
-the parameters, avoiding possible name conflicts. The possibility that
-some clot has defined CHAR_MIN as (say) `-128' instead of `(-128)' is
-now handled gracefully. "uchar" is no longer used as an internal type
-name (too many people have the same idea). Still the same old lousy
-performance, alas.
-
-New in alpha3.1: Basically nothing, this release is just a bookkeeping
-convenience. Stay tuned.
-
-New in alpha3.0: Performance is no better, alas, but some fixes have been
-made and some functionality has been added. (This is basically the "get
-it out the door in time for 4.4" release.) One bug fix: regfree() didn't
-free the main internal structure (how embarrassing). It is now possible
-to put NULs in either the RE or the target string, using (resp.) a new
-REG_PEND flag and the old REG_STARTEND flag. The REG_NOSPEC flag to
-regcomp() makes all characters ordinary, so you can match a literal
-string easily (this will become more useful when performance improves!).
-There are now primitives to match beginnings and ends of words, although
-the syntax is disgusting and so is the implementation. The REG_ATOI
-debugging interface has changed a bit. And there has been considerable
-internal cleanup of various kinds.
-
-New in alpha2.3: Split change list out of README, and moved flags notes
-into Makefile. Macro-ized the name of regex(7) in regex(3), since it has
-to change for 4.4BSD. Cleanup work in engine.c, and some new regression
-tests to catch tricky cases thereof.
-
-New in alpha2.2: Out-of-date manpages updated. Regerror() acquires two
-small extensions -- REG_ITOA and REG_ATOI -- which avoid debugging kludges
-in my own test program and might be useful to others for similar purposes.
-The regression test will now compile (and run) without REDEBUG. The
-BRE \$ bug is fixed. Most uses of "uchar" are gone; it's all chars now.
-Char/uchar parameters are now written int/unsigned, to avoid possible
-portability problems with unpromoted parameters. Some unsigned casts have
-been introduced to minimize portability problems with shifting into sign
-bits.
-
-New in alpha2.1: Lots of little stuff, cleanup and fixes. The one big
-thing is that regex.h is now generated, using mkh, rather than being
-supplied in the distribution; due to circularities in dependencies,
-you have to build regex.h explicitly by "make h". The two known bugs
-have been fixed (and the regression test now checks for them), as has a
-problem with assertions not being suppressed in the absence of REDEBUG.
-No performance work yet.
-
-New in alpha2: Backslash-anything is an ordinary character, not an
-error (except, of course, for the handful of backslashed metacharacters
-in BREs), which should reduce script breakage. The regression test
-checks *where* null strings are supposed to match, and has generally
-been tightened up somewhat. Small bug fixes in parameter passing (not
-harmful, but technically errors) and some other areas. Debugging
-invoked by defining REDEBUG rather than not defining NDEBUG.
-
-New in alpha+3: full prototyping for internal routines, using a little
-helper program, mkh, which extracts prototypes given in stylized comments.
-More minor cleanup. Buglet fix: it's CHAR_BIT, not CHAR_BITS. Simple
-pre-screening of input when a literal string is known to be part of the
-RE; this does wonders for performance.
-
-New in alpha+2: minor bits of cleanup. Notably, the number "32" for the
-word width isn't hardwired into regexec.c any more, the public header
-file prototypes the functions if __STDC__ is defined, and some small typos
-in the manpages have been fixed.
-
-New in alpha+1: improvements to the manual pages, and an important
-extension, the REG_STARTEND option to regexec().
+++ /dev/null
-/*-
- * Copyright (c) 1992, 1993, 1994 Henry Spencer.
- * Copyright (c) 1992, 1993, 1994
- * The Regents of the University of California. All rights reserved.
- *
- * This code is derived from software contributed to Berkeley by
- * Henry Spencer.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions
- * are met:
- * 1. Redistributions of source code must retain the above copyright
- * notice, this list of conditions and the following disclaimer.
- * 2. Redistributions in binary form must reproduce the above copyright
- * notice, this list of conditions and the following disclaimer in the
- * documentation and/or other materials provided with the distribution.
- * 3. All advertising materials mentioning features or use of this software
- * must display the following acknowledgement:
- * This product includes software developed by the University of
- * California, Berkeley and its contributors.
- * 4. Neither the name of the University nor the names of its contributors
- * may be used to endorse or promote products derived from this software
- * without specific prior written permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
- * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
- * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
- * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
- * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
- * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
- * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
- * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
- * SUCH DAMAGE.
- *
- * @(#)engine.c 8.5 (Berkeley) 3/20/94
- */
-
-#include "postgres.h"
-
-/*
- * The matching engine and friends. This file is #included by regexec.c
- * after suitable #defines of a variety of macros used herein, so that
- * different state representations can be used without duplicating masses
- * of code.
- */
-
-#ifdef SNAMES
-#define matcher smatcher
-#define fast sfast
-#define slow sslow
-#define dissect sdissect
-#define backref sbackref
-#define step sstep
-#define print sprint
-#define at sat
-#define match smat
-#endif
-#ifdef LNAMES
-#define matcher lmatcher
-#define fast lfast
-#define slow lslow
-#define dissect ldissect
-#define backref lbackref
-#define step lstep
-#define print lprint
-#define at lat
-#define match lmat
-#endif
-
-/* another structure passed up and down to avoid zillions of parameters */
-struct match
-{
- struct re_guts *g;
- int eflags;
- regmatch_t *pmatch; /* [nsub+1] (0 element unused) */
- pg_wchar *offp; /* offsets work from here */
- pg_wchar *beginp; /* start of string -- virtual NUL precedes */
- pg_wchar *endp; /* end of string -- virtual NUL here */
- pg_wchar *coldp; /* can be no match starting before here */
- pg_wchar **lastpos; /* [nplus+1] */
- STATEVARS;
- states st; /* current states */
- states fresh; /* states for a fresh start */
- states tmp; /* temporary */
- states empty; /* empty set of states */
-};
-
-static int matcher(struct re_guts * g, pg_wchar *string, size_t nmatch,
- regmatch_t *pmatch, int eflags);
-static pg_wchar *dissect(struct match * m, pg_wchar *start, pg_wchar *stop,
- sopno startst, sopno stopst);
-static pg_wchar *backref(struct match * m, pg_wchar *start, pg_wchar *stop,
- sopno startst, sopno stopst, sopno lev);
-static pg_wchar *fast(struct match * m, pg_wchar *start, pg_wchar *stop,
- sopno startst, sopno stopst);
-static pg_wchar *slow(struct match * m, pg_wchar *start, pg_wchar *stop,
- sopno startst, sopno stopst);
-static states step(struct re_guts * g, sopno start,
- sopno stop, states bef, int ch, states aft);
-
-#define BOL (OUT+1)
-#define EOL (BOL+1)
-#define BOLEOL (BOL+2)
-#define NOTHING (BOL+3)
-#define BOW (BOL+4)
-#define EOW (BOL+5)
-#define CODEMAX (BOL+5) /* highest code used */
-
-#define NONCHAR(c) ((c) > 16777216) /* 16777216 == 2^24 == 3 bytes */
-#define NNONCHAR (CODEMAX-16777216)
-
-#ifdef REDEBUG
-static void print(struct match * m, pg_wchar *caption, states st, int ch,
- FILE *d);
-static void at(struct match * m, pg_wchar *title, pg_wchar *start,
- pg_wchar *stop, sopno startst, sopno stopst);
-static pg_wchar *pchar(int ch);
-static int pg_isprint(int c);
-#endif
-
-#ifdef REDEBUG
-#define SP(t, s, c) print(m, t, s, c, stdout)
-#define AT(t, p1, p2, s1, s2) at(m, t, p1, p2, s1, s2)
-#define NOTE(str) \
-do { \
- if (m->eflags®_TRACE) \
- printf("=%s\n", (str)); \
-} while (0)
-
-#else
-#define SP(t, s, c) /* nothing */
-#define AT(t, p1, p2, s1, s2) /* nothing */
-#define NOTE(s) /* nothing */
-#endif
-
-/*
- * matcher - the actual matching engine
- */
-static int /* 0 success, REG_NOMATCH failure */
-matcher(struct re_guts * g, pg_wchar *string, size_t nmatch,
- regmatch_t *pmatch, int eflags)
-{
- pg_wchar *endp;
- int i;
- struct match mv;
- struct match *m = &mv;
- pg_wchar *dp;
- const sopno gf = g->firststate + 1; /* +1 for OEND */
- const sopno gl = g->laststate;
- pg_wchar *start;
- pg_wchar *stop;
-
- /* simplify the situation where possible */
- if (g->cflags & REG_NOSUB)
- nmatch = 0;
- if (eflags & REG_STARTEND)
- {
- start = string + pmatch[0].rm_so;
- stop = string + pmatch[0].rm_eo;
- }
- else
- {
- start = string;
- stop = start + pg_wchar_strlen(start);
- }
- if (stop < start)
- return REG_INVARG;
-
- /* prescreening; this does wonders for this rather slow code */
- if (g->must != NULL)
- {
- for (dp = start; dp < stop; dp++)
- if (*dp == g->must[0] && stop - dp >= g->mlen &&
- memcmp(dp, g->must, (size_t) (g->mlen * sizeof(pg_wchar))) == 0
- )
- break;
- if (dp == stop) /* we didn't find g->must */
- return REG_NOMATCH;
- }
-
- /* match struct setup */
- m->g = g;
- m->eflags = eflags;
- m->pmatch = NULL;
- m->lastpos = NULL;
- m->offp = string;
- m->beginp = start;
- m->endp = stop;
- STATESETUP(m, 4);
- SETUP(m->st);
- SETUP(m->fresh);
- SETUP(m->tmp);
- SETUP(m->empty);
- CLEAR(m->empty);
-
- /* this loop does only one repetition except for backrefs */
- for (;;)
- {
- endp = fast(m, start, stop, gf, gl);
- if (endp == NULL)
- { /* a miss */
- STATETEARDOWN(m);
- return REG_NOMATCH;
- }
- if (nmatch == 0 && !g->backrefs)
- break; /* no further info needed */
-
- /* where? */
- assert(m->coldp != NULL);
- for (;;)
- {
- NOTE("finding start");
- endp = slow(m, m->coldp, stop, gf, gl);
- if (endp != NULL)
- break;
- assert(m->coldp < m->endp);
- m->coldp++;
- }
- if (nmatch == 1 && !g->backrefs)
- break; /* no further info needed */
-
- /* oh my, he wants the subexpressions... */
- if (m->pmatch == NULL)
- m->pmatch = (regmatch_t *) malloc((m->g->nsub + 1) *
- sizeof(regmatch_t));
- if (m->pmatch == NULL)
- {
- STATETEARDOWN(m);
- return REG_ESPACE;
- }
- for (i = 1; i <= m->g->nsub; i++)
- m->pmatch[i].rm_so = m->pmatch[i].rm_eo = -1;
- if (!g->backrefs && !(m->eflags & REG_BACKR))
- {
- NOTE("dissecting");
- dp = dissect(m, m->coldp, endp, gf, gl);
- }
- else
- {
- if (g->nplus > 0 && m->lastpos == NULL)
- m->lastpos = (pg_wchar **) malloc((g->nplus + 1) *
- sizeof(pg_wchar *));
- if (g->nplus > 0 && m->lastpos == NULL)
- {
- free(m->pmatch);
- STATETEARDOWN(m);
- return REG_ESPACE;
- }
- NOTE("backref dissect");
- dp = backref(m, m->coldp, endp, gf, gl, (sopno) 0);
- }
- if (dp != NULL)
- break;
-
- /* uh-oh... we couldn't find a subexpression-level match */
- assert(g->backrefs); /* must be back references doing it */
- assert(g->nplus == 0 || m->lastpos != NULL);
- for (;;)
- {
- if (dp != NULL || endp <= m->coldp)
- break; /* defeat */
- NOTE("backoff");
- endp = slow(m, m->coldp, endp - 1, gf, gl);
- if (endp == NULL)
- break; /* defeat */
- /* try it on a shorter possibility */
-#ifndef NDEBUG
- for (i = 1; i <= m->g->nsub; i++)
- {
- assert(m->pmatch[i].rm_so == -1);
- assert(m->pmatch[i].rm_eo == -1);
- }
-#endif
- NOTE("backoff dissect");
- dp = backref(m, m->coldp, endp, gf, gl, (sopno) 0);
- }
- assert(dp == NULL || dp == endp);
- if (dp != NULL) /* found a shorter one */
- break;
-
- /* despite initial appearances, there is no match here */
- NOTE("false alarm");
- start = m->coldp + 1; /* recycle starting later */
- assert(start <= stop);
- }
-
- /* fill in the details if requested */
- if (nmatch > 0)
- {
- pmatch[0].rm_so = m->coldp - m->offp;
- pmatch[0].rm_eo = endp - m->offp;
- }
- if (nmatch > 1)
- {
- assert(m->pmatch != NULL);
- for (i = 1; i < nmatch; i++)
- if (i <= m->g->nsub)
- pmatch[i] = m->pmatch[i];
- else
- {
- pmatch[i].rm_so = -1;
- pmatch[i].rm_eo = -1;
- }
- }
-
- if (m->pmatch != NULL)
- free((pg_wchar *) m->pmatch);
- if (m->lastpos != NULL)
- free((pg_wchar *) m->lastpos);
- STATETEARDOWN(m);
- return 0;
-}
-
-/*
- * dissect - figure out what matched what, no back references
- */
-static pg_wchar * /* == stop (success) always */
-dissect(struct match * m, pg_wchar *start, pg_wchar *stop,
- sopno startst, sopno stopst)
-{
- int i;
- sopno ss; /* start sop of current subRE */
- sopno es; /* end sop of current subRE */
- pg_wchar *sp; /* start of string matched by it */
- pg_wchar *stp; /* string matched by it cannot pass here */
- pg_wchar *rest; /* start of rest of string */
- pg_wchar *tail; /* string unmatched by rest of RE */
- sopno ssub; /* start sop of subsubRE */
- sopno esub; /* end sop of subsubRE */
- pg_wchar *ssp; /* start of string matched by subsubRE */
- pg_wchar *sep; /* end of string matched by subsubRE */
- pg_wchar *oldssp; /* previous ssp */
- pg_wchar *dp;
-
- AT("diss", start, stop, startst, stopst);
- sp = start;
- for (ss = startst; ss < stopst; ss = es)
- {
- /* identify end of subRE */
- es = ss;
- switch (OP(m->g->strip[es]))
- {
- case OPLUS_:
- case OQUEST_:
- es += OPND(m->g->strip[es]);
- break;
- case OCH_:
- while (OP(m->g->strip[es]) != O_CH)
- es += OPND(m->g->strip[es]);
- break;
- }
- es++;
-
- /* figure out what it matched */
- switch (OP(m->g->strip[ss]))
- {
- case OEND:
- assert(nope);
- break;
- case OCHAR:
- sp++;
- break;
- case OBOL:
- case OEOL:
- case OBOW:
- case OEOW:
- break;
- case OANY:
- case OANYOF:
- sp++;
- break;
- case OBACK_:
- case O_BACK:
- assert(nope);
- break;
- /* cases where length of match is hard to find */
- case OQUEST_:
- stp = stop;
- for (;;)
- {
- /* how long could this one be? */
- rest = slow(m, sp, stp, ss, es);
- assert(rest != NULL); /* it did match */
- /* could the rest match the rest? */
- tail = slow(m, rest, stop, es, stopst);
- if (tail == stop)
- break; /* yes! */
- /* no -- try a shorter match for this one */
- stp = rest - 1;
- assert(stp >= sp); /* it did work */
- }
- ssub = ss + 1;
- esub = es - 1;
- /* did innards match? */
- if (slow(m, sp, rest, ssub, esub) != NULL)
- {
- dp = dissect(m, sp, rest, ssub, esub);
- assert(dp == rest);
- }
- else
-/* no */
- assert(sp == rest);
- sp = rest;
- break;
- case OPLUS_:
- stp = stop;
- for (;;)
- {
- /* how long could this one be? */
- rest = slow(m, sp, stp, ss, es);
- assert(rest != NULL); /* it did match */
- /* could the rest match the rest? */
- tail = slow(m, rest, stop, es, stopst);
- if (tail == stop)
- break; /* yes! */
- /* no -- try a shorter match for this one */
- stp = rest - 1;
- assert(stp >= sp); /* it did work */
- }
- ssub = ss + 1;
- esub = es - 1;
- ssp = sp;
- oldssp = ssp;
- for (;;)
- { /* find last match of innards */
- sep = slow(m, ssp, rest, ssub, esub);
- if (sep == NULL || sep == ssp)
- break; /* failed or matched null */
- oldssp = ssp; /* on to next try */
- ssp = sep;
- }
- if (sep == NULL)
- {
- /* last successful match */
- sep = ssp;
- ssp = oldssp;
- }
- assert(sep == rest); /* must exhaust substring */
- assert(slow(m, ssp, sep, ssub, esub) == rest);
- dp = dissect(m, ssp, sep, ssub, esub);
- assert(dp == sep);
- sp = rest;
- break;
- case OCH_:
- stp = stop;
- for (;;)
- {
- /* how long could this one be? */
- rest = slow(m, sp, stp, ss, es);
- assert(rest != NULL); /* it did match */
- /* could the rest match the rest? */
- tail = slow(m, rest, stop, es, stopst);
- if (tail == stop)
- break; /* yes! */
- /* no -- try a shorter match for this one */
- stp = rest - 1;
- assert(stp >= sp); /* it did work */
- }
- ssub = ss + 1;
- esub = ss + OPND(m->g->strip[ss]) - 1;
- assert(OP(m->g->strip[esub]) == OOR1);
- for (;;)
- { /* find first matching branch */
- if (slow(m, sp, rest, ssub, esub) == rest)
- break; /* it matched all of it */
- /* that one missed, try next one */
- assert(OP(m->g->strip[esub]) == OOR1);
- esub++;
- assert(OP(m->g->strip[esub]) == OOR2);
- ssub = esub + 1;
- esub += OPND(m->g->strip[esub]);
- if (OP(m->g->strip[esub]) == OOR2)
- esub--;
- else
- assert(OP(m->g->strip[esub]) == O_CH);
- }
- dp = dissect(m, sp, rest, ssub, esub);
- assert(dp == rest);
- sp = rest;
- break;
- case O_PLUS:
- case O_QUEST:
- case OOR1:
- case OOR2:
- case O_CH:
- assert(nope);
- break;
- case OLPAREN:
- i = OPND(m->g->strip[ss]);
- assert(0 < i && i <= m->g->nsub);
- m->pmatch[i].rm_so = sp - m->offp;
- break;
- case ORPAREN:
- i = OPND(m->g->strip[ss]);
- assert(0 < i && i <= m->g->nsub);
- m->pmatch[i].rm_eo = sp - m->offp;
- break;
- default: /* uh oh */
- assert(nope);
- break;
- }
- }
-
- assert(sp == stop);
- return sp;
-}
-
-/*
- * backref - figure out what matched what, figuring in back references
- *
- * lev is PLUS nesting level
- */
-static pg_wchar * /* == stop (success) or NULL (failure) */
-backref(struct match * m, pg_wchar *start, pg_wchar *stop,
- sopno startst, sopno stopst, sopno lev)
-{
- int i;
- sopno ss; /* start sop of current subRE */
- pg_wchar *sp; /* start of string matched by it */
- sopno ssub; /* start sop of subsubRE */
- sopno esub; /* end sop of subsubRE */
- pg_wchar *ssp; /* start of string matched by subsubRE */
- pg_wchar *dp;
- size_t len;
- int hard;
- sop s;
- regoff_t offsave;
- cset *cs;
-
- AT("back", start, stop, startst, stopst);
- sp = start;
-
- /* get as far as we can with easy stuff */
- hard = 0;
- for (ss = startst; !hard && ss < stopst; ss++)
- switch (OP(s = m->g->strip[ss]))
- {
- case OCHAR:
- if (sp == stop || *sp++ != (pg_wchar) OPND(s))
- return NULL;
- break;
- case OANY:
- if (sp == stop)
- return NULL;
- sp++;
- break;
- case OANYOF:
- cs = &m->g->sets[OPND(s)];
- if (sp == stop || !CHIN(cs, *sp++))
- return NULL;
- break;
- case OBOL:
- if ((sp == m->beginp && !(m->eflags & REG_NOTBOL)) ||
- (sp < m->endp && *(sp - 1) == '\n' &&
- (m->g->cflags & REG_NEWLINE)))
- { /* yes */
- }
- else
- return NULL;
- break;
- case OEOL:
- if ((sp == m->endp && !(m->eflags & REG_NOTEOL)) ||
- (sp < m->endp && *sp == '\n' &&
- (m->g->cflags & REG_NEWLINE)))
- { /* yes */
- }
- else
- return NULL;
- break;
- case OBOW:
- if (((sp == m->beginp && !(m->eflags & REG_NOTBOL)) ||
- (sp < m->endp && *(sp - 1) == '\n' &&
- (m->g->cflags & REG_NEWLINE)) ||
- (sp > m->beginp &&
- !ISWORD(*(sp - 1)))) &&
- (sp < m->endp && ISWORD(*sp)))
- { /* yes */
- }
- else
- return NULL;
- break;
- case OEOW:
- if (((sp == m->endp && !(m->eflags & REG_NOTEOL)) ||
- (sp < m->endp && *sp == '\n' &&
- (m->g->cflags & REG_NEWLINE)) ||
- (sp < m->endp && !ISWORD(*sp))) &&
- (sp > m->beginp && ISWORD(*(sp - 1))))
- { /* yes */
- }
- else
- return NULL;
- break;
- case O_QUEST:
- break;
- case OOR1: /* matches null but needs to skip */
- ss++;
- s = m->g->strip[ss];
- do
- {
- assert(OP(s) == OOR2);
- ss += OPND(s);
- } while (OP(s = m->g->strip[ss]) != O_CH);
- /* note that the ss++ gets us past the O_CH */
- break;
- default: /* have to make a choice */
- hard = 1;
- break;
- }
- if (!hard)
- { /* that was it! */
- if (sp != stop)
- return NULL;
- return sp;
- }
- ss--; /* adjust for the for's final increment */
-
- /* the hard stuff */
- AT("hard", sp, stop, ss, stopst);
- s = m->g->strip[ss];
- switch (OP(s))
- {
- case OBACK_: /* the vilest depths */
- i = OPND(s);
- assert(0 < i && i <= m->g->nsub);
- if (m->pmatch[i].rm_eo == -1)
- return NULL;
- assert(m->pmatch[i].rm_so != -1);
- len = m->pmatch[i].rm_eo - m->pmatch[i].rm_so;
- assert(stop - m->beginp >= len);
- if (sp > stop - len)
- return NULL; /* not enough left to match */
- ssp = m->offp + m->pmatch[i].rm_so;
- if (memcmp(sp, ssp, len) != 0)
- return NULL;
- while (m->g->strip[ss] != SOP(O_BACK, i))
- ss++;
- return backref(m, sp + len, stop, ss + 1, stopst, lev);
- break;
- case OQUEST_: /* to null or not */
- dp = backref(m, sp, stop, ss + 1, stopst, lev);
- if (dp != NULL)
- return dp; /* not */
- return backref(m, sp, stop, ss + OPND(s) + 1, stopst, lev);
- break;
- case OPLUS_:
- assert(m->lastpos != NULL);
- assert(lev + 1 <= m->g->nplus);
- m->lastpos[lev + 1] = sp;
- return backref(m, sp, stop, ss + 1, stopst, lev + 1);
- break;
- case O_PLUS:
- if (sp == m->lastpos[lev]) /* last pass matched null */
- return backref(m, sp, stop, ss + 1, stopst, lev - 1);
- /* try another pass */
- m->lastpos[lev] = sp;
- dp = backref(m, sp, stop, ss - OPND(s) + 1, stopst, lev);
- if (dp == NULL)
- return backref(m, sp, stop, ss + 1, stopst, lev - 1);
- else
- return dp;
- break;
- case OCH_: /* find the right one, if any */
- ssub = ss + 1;
- esub = ss + OPND(s) - 1;
- assert(OP(m->g->strip[esub]) == OOR1);
- for (;;)
- { /* find first matching branch */
- dp = backref(m, sp, stop, ssub, esub, lev);
- if (dp != NULL)
- return dp;
- /* that one missed, try next one */
- if (OP(m->g->strip[esub]) == O_CH)
- return NULL; /* there is none */
- esub++;
- assert(OP(m->g->strip[esub]) == OOR2);
- ssub = esub + 1;
- esub += OPND(m->g->strip[esub]);
- if (OP(m->g->strip[esub]) == OOR2)
- esub--;
- else
- assert(OP(m->g->strip[esub]) == O_CH);
- }
- break;
- case OLPAREN: /* must undo assignment if rest fails */
- i = OPND(s);
- assert(0 < i && i <= m->g->nsub);
- offsave = m->pmatch[i].rm_so;
- m->pmatch[i].rm_so = sp - m->offp;
- dp = backref(m, sp, stop, ss + 1, stopst, lev);
- if (dp != NULL)
- return dp;
- m->pmatch[i].rm_so = offsave;
- return NULL;
- break;
- case ORPAREN: /* must undo assignment if rest fails */
- i = OPND(s);
- assert(0 < i && i <= m->g->nsub);
- offsave = m->pmatch[i].rm_eo;
- m->pmatch[i].rm_eo = sp - m->offp;
- dp = backref(m, sp, stop, ss + 1, stopst, lev);
- if (dp != NULL)
- return dp;
- m->pmatch[i].rm_eo = offsave;
- return NULL;
- break;
- default: /* uh oh */
- assert(nope);
- break;
- }
-
- /* "can't happen" */
- assert(nope);
- /* NOTREACHED */
- return 0;
-}
-
-/*
- * fast - step through the string at top speed
- */
-static pg_wchar * /* where tentative match ended, or NULL */
-fast(struct match * m, pg_wchar *start, pg_wchar *stop,
- sopno startst, sopno stopst)
-{
- states st = m->st;
- states fresh = m->fresh;
- states tmp = m->tmp;
- pg_wchar *p = start;
- int c = (start == m->beginp) ? OUT : *(start - 1);
- int lastc; /* previous c */
- int flagch;
- int i;
- pg_wchar *coldp; /* last p after which no match was
- * underway */
-
- CLEAR(st);
- SET1(st, startst);
- st = step(m->g, startst, stopst, st, NOTHING, st);
- ASSIGN(fresh, st);
- SP("start", st, *p);
- coldp = NULL;
- for (;;)
- {
- /* next character */
- lastc = c;
- c = (p == m->endp) ? OUT : *p;
- if (EQ(st, fresh))
- coldp = p;
-
- /* is there an EOL and/or BOL between lastc and c? */
- flagch = '\0';
- i = 0;
- if ((lastc == '\n' && m->g->cflags & REG_NEWLINE) ||
- (lastc == OUT && !(m->eflags & REG_NOTBOL)))
- {
- flagch = BOL;
- i = m->g->nbol;
- }
- if ((c == '\n' && m->g->cflags & REG_NEWLINE) ||
- (c == OUT && !(m->eflags & REG_NOTEOL)))
- {
- flagch = (flagch == BOL) ? BOLEOL : EOL;
- i += m->g->neol;
- }
- if (i != 0)
- {
- for (; i > 0; i--)
- st = step(m->g, startst, stopst, st, flagch, st);
- SP("boleol", st, c);
- }
-
- /* how about a word boundary? */
- if ((flagch == BOL || (lastc != OUT && !ISWORD(lastc))) &&
- (c != OUT && ISWORD(c)))
- flagch = BOW;
- if ((lastc != OUT && ISWORD(lastc)) &&
- (flagch == EOL || (c != OUT && !ISWORD(c))))
- flagch = EOW;
- if (flagch == BOW || flagch == EOW)
- {
- st = step(m->g, startst, stopst, st, flagch, st);
- SP("boweow", st, c);
- }
-
- /* are we done? */
- if (ISSET(st, stopst) || p == stop)
- break; /* NOTE BREAK OUT */
-
- /* no, we must deal with this character */
- ASSIGN(tmp, st);
- ASSIGN(st, fresh);
- assert(c != OUT);
- st = step(m->g, startst, stopst, tmp, c, st);
- SP("aft", st, c);
- assert(EQ(step(m->g, startst, stopst, st, NOTHING, st), st));
- p++;
- }
-
- assert(coldp != NULL);
- m->coldp = coldp;
- if (ISSET(st, stopst))
- return p + 1;
- else
- return NULL;
-}
-
-/*
- * slow - step through the string more deliberately
- */
-static pg_wchar * /* where it ended */
-slow(struct match * m, pg_wchar *start, pg_wchar *stop,
- sopno startst, sopno stopst)
-{
- states st = m->st;
- states empty = m->empty;
- states tmp = m->tmp;
- pg_wchar *p = start;
- int c = (start == m->beginp) ? OUT : *(start - 1);
- int lastc; /* previous c */
- int flagch;
- int i;
- pg_wchar *matchp; /* last p at which a match ended */
-
- AT("slow", start, stop, startst, stopst);
- CLEAR(st);
- SET1(st, startst);
- SP("sstart", st, *p);
- st = step(m->g, startst, stopst, st, NOTHING, st);
- matchp = NULL;
- for (;;)
- {
- /* next character */
- lastc = c;
- c = (p == m->endp) ? OUT : *p;
-
- /* is there an EOL and/or BOL between lastc and c? */
- flagch = '\0';
- i = 0;
- if ((lastc == '\n' && m->g->cflags & REG_NEWLINE) ||
- (lastc == OUT && !(m->eflags & REG_NOTBOL)))
- {
- flagch = BOL;
- i = m->g->nbol;
- }
- if ((c == '\n' && m->g->cflags & REG_NEWLINE) ||
- (c == OUT && !(m->eflags & REG_NOTEOL)))
- {
- flagch = (flagch == BOL) ? BOLEOL : EOL;
- i += m->g->neol;
- }
- if (i != 0)
- {
- for (; i > 0; i--)
- st = step(m->g, startst, stopst, st, flagch, st);
- SP("sboleol", st, c);
- }
-
- /* how about a word boundary? */
- if ((flagch == BOL || (lastc != OUT && !ISWORD(lastc))) &&
- (c != OUT && ISWORD(c)))
- flagch = BOW;
- if ((lastc != OUT && ISWORD(lastc)) &&
- (flagch == EOL || (c != OUT && !ISWORD(c))))
- flagch = EOW;
- if (flagch == BOW || flagch == EOW)
- {
- st = step(m->g, startst, stopst, st, flagch, st);
- SP("sboweow", st, c);
- }
-
- /* are we done? */
- if (ISSET(st, stopst))
- matchp = p;
- if (EQ(st, empty) || p == stop)
- break; /* NOTE BREAK OUT */
-
- /* no, we must deal with this character */
- ASSIGN(tmp, st);
- ASSIGN(st, empty);
- assert(c != OUT);
- st = step(m->g, startst, stopst, tmp, c, st);
- SP("saft", st, c);
- assert(EQ(step(m->g, startst, stopst, st, NOTHING, st), st));
- p++;
- }
-
- return matchp;
-}
-
-
-/*
- * step - map set of states reachable before char to set reachable after
- */
-static states
-step(struct re_guts * g,
- sopno start, /* start state within strip */
- sopno stop, /* state after stop state within strip */
- states bef, /* states reachable before */
- int ch, /* character or NONCHAR code */
- states aft) /* states already known reachable after */
-{
- cset *cs;
- sop s;
- sopno pc;
- onestate here; /* note, macros know this name */
- sopno look;
- int i;
-
- for (pc = start, INIT(here, pc); pc != stop; pc++, INC(here))
- {
- s = g->strip[pc];
- switch (OP(s))
- {
- case OEND:
- assert(pc == stop - 1);
- break;
- case OCHAR:
- /* only characters can match */
- assert(!NONCHAR(ch) || ch != (pg_wchar) OPND(s));
- if (ch == (pg_wchar) OPND(s))
- FWD(aft, bef, 1);
- break;
- case OBOL:
- if (ch == BOL || ch == BOLEOL)
- FWD(aft, bef, 1);
- break;
- case OEOL:
- if (ch == EOL || ch == BOLEOL)
- FWD(aft, bef, 1);
- break;
- case OBOW:
- if (ch == BOW)
- FWD(aft, bef, 1);
- break;
- case OEOW:
- if (ch == EOW)
- FWD(aft, bef, 1);
- break;
- case OANY:
- if (!NONCHAR(ch))
- FWD(aft, bef, 1);
- break;
- case OANYOF:
- cs = &g->sets[OPND(s)];
- if (!NONCHAR(ch) && CHIN(cs, ch))
- FWD(aft, bef, 1);
- break;
- case OBACK_: /* ignored here */
- case O_BACK:
- FWD(aft, aft, 1);
- break;
- case OPLUS_: /* forward, this is just an empty */
- FWD(aft, aft, 1);
- break;
- case O_PLUS: /* both forward and back */
- FWD(aft, aft, 1);
- i = ISSETBACK(aft, OPND(s));
- BACK(aft, aft, OPND(s));
- if (!i && ISSETBACK(aft, OPND(s)))
- {
- /* oho, must reconsider loop body */
- pc -= OPND(s) + 1;
- INIT(here, pc);
- }
- break;
- case OQUEST_: /* two branches, both forward */
- FWD(aft, aft, 1);
- FWD(aft, aft, OPND(s));
- break;
- case O_QUEST: /* just an empty */
- FWD(aft, aft, 1);
- break;
- case OLPAREN: /* not significant here */
- case ORPAREN:
- FWD(aft, aft, 1);
- break;
- case OCH_: /* mark the first two branches */
- FWD(aft, aft, 1);
- assert(OP(g->strip[pc + OPND(s)]) == OOR2);
- FWD(aft, aft, OPND(s));
- break;
- case OOR1: /* done a branch, find the O_CH */
- if (ISSTATEIN(aft, here))
- {
- for (look = 1;
- OP(s = g->strip[pc + look]) != O_CH;
- look += OPND(s))
- assert(OP(s) == OOR2);
- FWD(aft, aft, look);
- }
- break;
- case OOR2: /* propagate OCH_'s marking */
- FWD(aft, aft, 1);
- if (OP(g->strip[pc + OPND(s)]) != O_CH)
- {
- assert(OP(g->strip[pc + OPND(s)]) == OOR2);
- FWD(aft, aft, OPND(s));
- }
- break;
- case O_CH: /* just empty */
- FWD(aft, aft, 1);
- break;
- default: /* ooooops... */
- assert(nope);
- break;
- }
- }
-
- return aft;
-}
-
-#ifdef REDEBUG
-/*
- * print - print a set of states
- */
-static void
-print(struct match * m, pg_wchar *caption, states st,
- int ch, FILE *d)
-{
- struct re_guts *g = m->g;
- int i;
- int first = 1;
-
- if (!(m->eflags & REG_TRACE))
- return;
-
- fprintf(d, "%s", caption);
- if (ch != '\0')
- fprintf(d, " %s", pchar(ch));
- for (i = 0; i < g->nstates; i++)
- if (ISSET(st, i))
- {
- fprintf(d, "%s%d", (first) ? "\t" : ", ", i);
- first = 0;
- }
- fprintf(d, "\n");
-}
-
-/*
- * at - print current situation
- */
-static void
-at(struct match * m, pg_wchar *title, pg_wchar *start, pg_wchar *stop,
- sopno startst, sopno stopst)
-{
- if (!(m->eflags & REG_TRACE))
- return;
-
- printf("%s %s-", title, pchar(*start));
- printf("%s ", pchar(*stop));
- printf("%ld-%ld\n", (long) startst, (long) stopst);
-}
-
-#ifndef PCHARDONE
-#define PCHARDONE /* only do this once */
-/*
- * pchar - make a character printable
- *
- * Is this identical to regchar() over in debug.c? Well, yes. But a
- * duplicate here avoids having a debugging-capable regexec.o tied to
- * a matching debug.o, and this is convenient. It all disappears in
- * the non-debug compilation anyway, so it doesn't matter much.
- */
-static pg_wchar * /* -> representation */
-pchar(int ch)
-{
- static pg_wchar pbuf[10];
-
- if (pg_isprint(ch) || ch == ' ')
- sprintf(pbuf, "%c", ch);
- else
- sprintf(pbuf, "\\%o", ch);
- return pbuf;
-}
-
-static int
-pg_isprint(int c)
-{
- return (c >= 0 && c <= UCHAR_MAX && isprint((unsigned char) c));
-}
-#endif
-#endif
-
-#undef matcher
-#undef fast
-#undef slow
-#undef dissect
-#undef backref
-#undef step
-#undef print
-#undef at
-#undef match
+++ /dev/null
-.\" Copyright (c) 1992, 1993, 1994 Henry Spencer.
-.\" Copyright (c) 1992, 1993, 1994
-.\" The Regents of the University of California. All rights reserved.
-.\"
-.\" This code is derived from software contributed to Berkeley by
-.\" Henry Spencer.
-.\"
-.\" Redistribution and use in source and binary forms, with or without
-.\" modification, are permitted provided that the following conditions
-.\" are met:
-.\" 1. Redistributions of source code must retain the above copyright
-.\" notice, this list of conditions and the following disclaimer.
-.\" 2. Redistributions in binary form must reproduce the above copyright
-.\" notice, this list of conditions and the following disclaimer in the
-.\" documentation and/or other materials provided with the distribution.
-.\" 3. All advertising materials mentioning features or use of this software
-.\" must display the following acknowledgement:
-.\" This product includes software developed by the University of
-.\" California, Berkeley and its contributors.
-.\" 4. Neither the name of the University nor the names of its contributors
-.\" may be used to endorse or promote products derived from this software
-.\" without specific prior written permission.
-.\"
-.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
-.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
-.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
-.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
-.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
-.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
-.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
-.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
-.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
-.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
-.\" SUCH DAMAGE.
-.\"
-.\" @(#)re_format.7 8.3 (Berkeley) 3/20/94
-.\"
-.TH RE_FORMAT 7 "March 20, 1994"
-.SH NAME
-re_format \- POSIX 1003.2 regular expressions
-.SH DESCRIPTION
-Regular expressions (``RE''s),
-as defined in POSIX 1003.2, come in two forms:
-modern REs (roughly those of
-.IR egrep ;
-1003.2 calls these ``extended'' REs)
-and obsolete REs (roughly those of
-.IR ed ;
-1003.2 ``basic'' REs).
-Obsolete REs mostly exist for backward compatibility in some old programs;
-they will be discussed at the end.
-1003.2 leaves some aspects of RE syntax and semantics open;
-`\(dg' marks decisions on these aspects that
-may not be fully portable to other 1003.2 implementations.
-.PP
-A (modern) RE is one\(dg or more non-empty\(dg \fIbranches\fR,
-separated by `|'.
-It matches anything that matches one of the branches.
-.PP
-A branch is one\(dg or more \fIpieces\fR, concatenated.
-It matches a match for the first, followed by a match for the second, etc.
-.PP
-A piece is an \fIatom\fR possibly followed
-by a single\(dg `*', `+', `?', or \fIbound\fR.
-An atom followed by `*' matches a sequence of 0 or more matches of the atom.
-An atom followed by `+' matches a sequence of 1 or more matches of the atom.
-An atom followed by `?' matches a sequence of 0 or 1 matches of the atom.
-.PP
-A \fIbound\fR is `{' followed by an unsigned decimal integer,
-possibly followed by `,'
-possibly followed by another unsigned decimal integer,
-always followed by `}'.
-The integers must lie between 0 and RE_DUP_MAX (255\(dg) inclusive,
-and if there are two of them, the first may not exceed the second.
-An atom followed by a bound containing one integer \fIi\fR
-and no comma matches
-a sequence of exactly \fIi\fR matches of the atom.
-An atom followed by a bound
-containing one integer \fIi\fR and a comma matches
-a sequence of \fIi\fR or more matches of the atom.
-An atom followed by a bound
-containing two integers \fIi\fR and \fIj\fR matches
-a sequence of \fIi\fR through \fIj\fR (inclusive) matches of the atom.
-.PP
-An atom is a regular expression enclosed in `()' (matching a match for the
-regular expression),
-an empty set of `()' (matching the null string)\(dg,
-a \fIbracket expression\fR (see below), `.'
-(matching any single character), `^' (matching the null string at the
-beginning of a line), `$' (matching the null string at the
-end of a line), a `\e' followed by one of the characters
-`^.[$()|*+?{\e'
-(matching that character taken as an ordinary character),
-a `\e' followed by any other character\(dg
-(matching that character taken as an ordinary character,
-as if the `\e' had not been present\(dg),
-or a single character with no other significance (matching that character).
-A `{' followed by a character other than a digit is an ordinary
-character, not the beginning of a bound\(dg.
-It is illegal to end an RE with `\e'.
-.PP
-A \fIbracket expression\fR is a list of characters enclosed in `[]'.
-It normally matches any single character from the list (but see below).
-If the list begins with `^',
-it matches any single character
-(but see below) \fInot\fR from the rest of the list.
-If two characters in the list are separated by `\-', this is shorthand
-for the full \fIrange\fR of characters between those two (inclusive) in the
-collating sequence,
-e.g. `[0-9]' in ASCII matches any decimal digit.
-It is illegal\(dg for two ranges to share an
-endpoint, e.g. `a-c-e'.
-Ranges are very collating-sequence-dependent,
-and portable programs should avoid relying on them.
-.PP
-To include a literal `]' in the list, make it the first character
-(following a possible `^').
-To include a literal `\-', make it the first or last character,
-or the second endpoint of a range.
-To use a literal `\-' as the first endpoint of a range,
-enclose it in `[.' and `.]' to make it a collating element (see below).
-With the exception of these and some combinations using `[' (see next
-paragraphs), all other special characters, including `\e', lose their
-special significance within a bracket expression.
-.PP
-Within a bracket expression, a collating element (a character,
-a multi-character sequence that collates as if it were a single character,
-or a collating-sequence name for either)
-enclosed in `[.' and `.]' stands for the
-sequence of characters of that collating element.
-The sequence is a single element of the bracket expression's list.
-A bracket expression containing a multi-character collating element
-can thus match more than one character,
-e.g. if the collating sequence includes a `ch' collating element,
-then the RE `[[.ch.]]*c' matches the first five characters
-of `chchcc'.
-.PP
-Within a bracket expression, a collating element enclosed in `[=' and
-`=]' is an equivalence class, standing for the sequences of characters
-of all collating elements equivalent to that one, including itself.
-(If there are no other equivalent collating elements,
-the treatment is as if the enclosing delimiters were `[.' and `.]'.)
-For example, if o and \o'o^' are the members of an equivalence class,
-then `[[=o=]]', `[[=\o'o^'=]]', and `[o\o'o^']' are all synonymous.
-An equivalence class may not\(dg be an endpoint
-of a range.
-.PP
-Within a bracket expression, the name of a \fIcharacter class\fR enclosed
-in `[:' and `:]' stands for the list of all characters belonging to that
-class.
-Standard character class names are:
-.PP
-.RS
-.nf
-.ta 3c 6c 9c
-alnum digit punct
-alpha graph space
-blank lower upper
-cntrl print xdigit
-.fi
-.RE
-.PP
-These stand for the character classes defined in
-.IR ctype (3).
-A locale may provide others.
-A character class may not be used as an endpoint of a range.
-.PP
-There are two special cases\(dg of bracket expressions:
-the bracket expressions `[[:<:]]' and `[[:>:]]' match the null string at
-the beginning and end of a word respectively.
-A word is defined as a sequence of
-word characters
-which is neither preceded nor followed by
-word characters.
-A word character is an
-.I alnum
-character (as defined by
-.IR ctype (3))
-or an underscore.
-This is an extension,
-compatible with but not specified by POSIX 1003.2,
-and should be used with
-caution in software intended to be portable to other systems.
-.PP
-In the event that an RE could match more than one substring of a given
-string,
-the RE matches the one starting earliest in the string.
-If the RE could match more than one substring starting at that point,
-it matches the longest.
-Subexpressions also match the longest possible substrings, subject to
-the constraint that the whole match be as long as possible,
-with subexpressions starting earlier in the RE taking priority over
-ones starting later.
-Note that higher-level subexpressions thus take priority over
-their lower-level component subexpressions.
-.PP
-Match lengths are measured in characters, not collating elements.
-A null string is considered longer than no match at all.
-For example,
-`bb*' matches the three middle characters of `abbbc',
-`(wee|week)(knights|nights)' matches all ten characters of `weeknights',
-when `(.*).*' is matched against `abc' the parenthesized subexpression
-matches all three characters, and
-when `(a*)*' is matched against `bc' both the whole RE and the parenthesized
-subexpression match the null string.
-.PP
-If case-independent matching is specified,
-the effect is much as if all case distinctions had vanished from the
-alphabet.
-When an alphabetic that exists in multiple cases appears as an
-ordinary character outside a bracket expression, it is effectively
-transformed into a bracket expression containing both cases,
-e.g. `x' becomes `[xX]'.
-When it appears inside a bracket expression, all case counterparts
-of it are added to the bracket expression, so that (e.g.) `[x]'
-becomes `[xX]' and `[^x]' becomes `[^xX]'.
-.PP
-No particular limit is imposed on the length of REs\(dg.
-Programs intended to be portable should not employ REs longer
-than 256 bytes,
-as an implementation can refuse to accept such REs and remain
-POSIX-compliant.
-.PP
-Obsolete (``basic'') regular expressions differ in several respects.
-`|', `+', and `?' are ordinary characters and there is no equivalent
-for their functionality.
-The delimiters for bounds are `\e{' and `\e}',
-with `{' and `}' by themselves ordinary characters.
-The parentheses for nested subexpressions are `\e(' and `\e)',
-with `(' and `)' by themselves ordinary characters.
-`^' is an ordinary character except at the beginning of the
-RE or\(dg the beginning of a parenthesized subexpression,
-`$' is an ordinary character except at the end of the
-RE or\(dg the end of a parenthesized subexpression,
-and `*' is an ordinary character if it appears at the beginning of the
-RE or the beginning of a parenthesized subexpression
-(after a possible leading `^').
-Finally, there is one new type of atom, a \fIback reference\fR:
-`\e' followed by a non-zero decimal digit \fId\fR
-matches the same sequence of characters
-matched by the \fId\fRth parenthesized subexpression
-(numbering subexpressions by the positions of their opening parentheses,
-left to right),
-so that (e.g.) `\e([bc]\e)\e1' matches `bb' or `cc' but not `bc'.
-.SH SEE ALSO
-regex(3)
-.PP
-POSIX 1003.2, section 2.8 (Regular Expression Notation).
-.SH BUGS
-Having two kinds of REs is a botch.
-.PP
-The current 1003.2 spec says that `)' is an ordinary character in
-the absence of an unmatched `(';
-this was an unintentional result of a wording error,
-and change is likely.
-Avoid relying on it.
-.PP
-Back references are a dreadful botch,
-posing major problems for efficient implementations.
-They are also somewhat vaguely defined
-(does
-`a\e(\e(b\e)*\e2\e)*d' match `abbbd'?).
-Avoid using them.
-.PP
-1003.2's specification of case-independent matching is vague.
-The ``one case implies all cases'' definition given above
-is current consensus among implementors as to the right interpretation.
-.PP
-The syntax for word boundaries is incredibly ugly.
--- /dev/null
+'\"
+'\" Copyright (c) 1998 Sun Microsystems, Inc.
+'\" Copyright (c) 1999 Scriptics Corporation
+'\"
+'\" This software is copyrighted by the Regents of the University of
+'\" California, Sun Microsystems, Inc., Scriptics Corporation, ActiveState
+'\" Corporation and other parties. The following terms apply to all files
+'\" associated with the software unless explicitly disclaimed in
+'\" individual files.
+'\"
+'\" The authors hereby grant permission to use, copy, modify, distribute,
+'\" and license this software and its documentation for any purpose, provided
+'\" that existing copyright notices are retained in all copies and that this
+'\" notice is included verbatim in any distributions. No written agreement,
+'\" license, or royalty fee is required for any of the authorized uses.
+'\" Modifications to this software may be copyrighted by their authors
+'\" and need not follow the licensing terms described here, provided that
+'\" the new terms are clearly indicated on the first page of each file where
+'\" they apply.
+'\"
+'\" IN NO EVENT SHALL THE AUTHORS OR DISTRIBUTORS BE LIABLE TO ANY PARTY
+'\" FOR DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES
+'\" ARISING OUT OF THE USE OF THIS SOFTWARE, ITS DOCUMENTATION, OR ANY
+'\" DERIVATIVES THEREOF, EVEN IF THE AUTHORS HAVE BEEN ADVISED OF THE
+'\" POSSIBILITY OF SUCH DAMAGE.
+'\"
+'\" THE AUTHORS AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
+'\" INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY,
+'\" FITNESS FOR A PARTICULAR PURPOSE, AND NON-INFRINGEMENT. THIS SOFTWARE
+'\" IS PROVIDED ON AN "AS IS" BASIS, AND THE AUTHORS AND DISTRIBUTORS HAVE
+'\" NO OBLIGATION TO PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR
+'\" MODIFICATIONS.
+'\"
+'\" GOVERNMENT USE: If you are acquiring this software on behalf of the
+'\" U.S. government, the Government shall have only "Restricted Rights"
+'\" in the software and related documentation as defined in the Federal
+'\" Acquisition Regulations (FARs) in Clause 52.227.19 (c) (2). If you
+'\" are acquiring the software on behalf of the Department of Defense, the
+'\" software shall be classified as "Commercial Computer Software" and the
+'\" Government shall have only "Restricted Rights" as defined in Clause
+'\" 252.227-7013 (c) (1) of DFARs. Notwithstanding the foregoing, the
+'\" authors grant the U.S. Government and others acting in its behalf
+'\" permission to use and distribute the software in accordance with the
+'\" terms specified in this license.
+'\"
+'\" RCS: @(#) Id: re_syntax.n,v 1.3 1999/07/14 19:09:36 jpeek Exp
+'\"
+.so man.macros
+.TH re_syntax n "8.1" Tcl "Tcl Built-In Commands"
+.BS
+.SH NAME
+re_syntax \- Syntax of Tcl regular expressions.
+.BE
+
+.SH DESCRIPTION
+.PP
+A \fIregular expression\fR describes strings of characters.
+It's a pattern that matches certain strings and doesn't match others.
+
+.SH "DIFFERENT FLAVORS OF REs"
+Regular expressions (``RE''s), as defined by POSIX, come in two
+flavors: \fIextended\fR REs (``EREs'') and \fIbasic\fR REs (``BREs'').
+EREs are roughly those of the traditional \fIegrep\fR, while BREs are
+roughly those of the traditional \fIed\fR. This implementation adds
+a third flavor, \fIadvanced\fR REs (``AREs''), basically EREs with
+some significant extensions.
+.PP
+This manual page primarily describes AREs. BREs mostly exist for
+backward compatibility in some old programs; they will be discussed at
+the end. POSIX EREs are almost an exact subset of AREs. Features of
+AREs that are not present in EREs will be indicated.
+
+.SH "REGULAR EXPRESSION SYNTAX"
+.PP
+Tcl regular expressions are implemented using the package written by
+Henry Spencer, based on the 1003.2 spec and some (not quite all) of
+the Perl5 extensions (thanks, Henry!). Much of the description of
+regular expressions below is copied verbatim from his manual entry.
+.PP
+An ARE is one or more \fIbranches\fR,
+separated by `\fB|\fR',
+matching anything that matches any of the branches.
+.PP
+A branch is zero or more \fIconstraints\fR or \fIquantified atoms\fR,
+concatenated.
+It matches a match for the first, followed by a match for the second, etc;
+an empty branch matches the empty string.
+.PP
+A quantified atom is an \fIatom\fR possibly followed
+by a single \fIquantifier\fR.
+Without a quantifier, it matches a match for the atom.
+The quantifiers,
+and what a so-quantified atom matches, are:
+.RS 2
+.TP 6
+\fB*\fR
+a sequence of 0 or more matches of the atom
+.TP
+\fB+\fR
+a sequence of 1 or more matches of the atom
+.TP
+\fB?\fR
+a sequence of 0 or 1 matches of the atom
+.TP
+\fB{\fIm\fB}\fR
+a sequence of exactly \fIm\fR matches of the atom
+.TP
+\fB{\fIm\fB,}\fR
+a sequence of \fIm\fR or more matches of the atom
+.TP
+\fB{\fIm\fB,\fIn\fB}\fR
+a sequence of \fIm\fR through \fIn\fR (inclusive) matches of the atom;
+\fIm\fR may not exceed \fIn\fR
+.TP
+\fB*? +? ?? {\fIm\fB}? {\fIm\fB,}? {\fIm\fB,\fIn\fB}?\fR
+\fInon-greedy\fR quantifiers,
+which match the same possibilities,
+but prefer the smallest number rather than the largest number
+of matches (see MATCHING)
+.RE
+.PP
+The forms using
+\fB{\fR and \fB}\fR
+are known as \fIbound\fRs.
+The numbers
+\fIm\fR and \fIn\fR are unsigned decimal integers
+with permissible values from 0 to 255 inclusive.
+.PP
+An atom is one of:
+.RS 2
+.TP 6
+\fB(\fIre\fB)\fR
+(where \fIre\fR is any regular expression)
+matches a match for
+\fIre\fR, with the match noted for possible reporting
+.TP
+\fB(?:\fIre\fB)\fR
+as previous,
+but does no reporting
+(a ``non-capturing'' set of parentheses)
+.TP
+\fB()\fR
+matches an empty string,
+noted for possible reporting
+.TP
+\fB(?:)\fR
+matches an empty string,
+without reporting
+.TP
+\fB[\fIchars\fB]\fR
+a \fIbracket expression\fR,
+matching any one of the \fIchars\fR (see BRACKET EXPRESSIONS for more detail)
+.TP
+ \fB.\fR
+matches any single character
+.TP
+\fB\e\fIk\fR
+(where \fIk\fR is a non-alphanumeric character)
+matches that character taken as an ordinary character,
+e.g. \e\e matches a backslash character
+.TP
+\fB\e\fIc\fR
+where \fIc\fR is alphanumeric
+(possibly followed by other characters),
+an \fIescape\fR (AREs only),
+see ESCAPES below
+.TP
+\fB{\fR
+when followed by a character other than a digit,
+matches the left-brace character `\fB{\fR';
+when followed by a digit, it is the beginning of a
+\fIbound\fR (see above)
+.TP
+\fIx\fR
+where \fIx\fR is
+a single character with no other significance, matches that character.
+.RE
+.PP
+A \fIconstraint\fR matches an empty string when specific conditions
+are met.
+A constraint may not be followed by a quantifier.
+The simple constraints are as follows; some more constraints are
+described later, under ESCAPES.
+.RS 2
+.TP 8
+\fB^\fR
+matches at the beginning of a line
+.TP
+\fB$\fR
+matches at the end of a line
+.TP
+\fB(?=\fIre\fB)\fR
+\fIpositive lookahead\fR (AREs only), matches at any point
+where a substring matching \fIre\fR begins
+.TP
+\fB(?!\fIre\fB)\fR
+\fInegative lookahead\fR (AREs only), matches at any point
+where no substring matching \fIre\fR begins
+.RE
+.PP
+The lookahead constraints may not contain back references (see later),
+and all parentheses within them are considered non-capturing.
+.PP
+An RE may not end with `\fB\e\fR'.
+
+.SH "BRACKET EXPRESSIONS"
+A \fIbracket expression\fR is a list of characters enclosed in `\fB[\|]\fR'.
+It normally matches any single character from the list (but see below).
+If the list begins with `\fB^\fR',
+it matches any single character
+(but see below) \fInot\fR from the rest of the list.
+.PP
+If two characters in the list are separated by `\fB\-\fR',
+this is shorthand
+for the full \fIrange\fR of characters between those two (inclusive) in the
+collating sequence,
+e.g.
+\fB[0\-9]\fR
+in ASCII matches any decimal digit.
+Two ranges may not share an
+endpoint, so e.g.
+\fBa\-c\-e\fR
+is illegal.
+Ranges are very collating-sequence-dependent,
+and portable programs should avoid relying on them.
+.PP
+To include a literal
+\fB]\fR
+or
+\fB\-\fR
+in the list,
+the simplest method is to
+enclose it in
+\fB[.\fR and \fB.]\fR
+to make it a collating element (see below).
+Alternatively,
+make it the first character
+(following a possible `\fB^\fR'),
+or (AREs only) precede it with `\fB\e\fR'.
+Alternatively, for `\fB\-\fR',
+make it the last character,
+or the second endpoint of a range.
+To use a literal
+\fB\-\fR
+as the first endpoint of a range,
+make it a collating element
+or (AREs only) precede it with `\fB\e\fR'.
+With the exception of these, some combinations using
+\fB[\fR
+(see next
+paragraphs), and escapes,
+all other special characters lose their
+special significance within a bracket expression.
+.PP
+Within a bracket expression, a collating element (a character,
+a multi-character sequence that collates as if it were a single character,
+or a collating-sequence name for either)
+enclosed in
+\fB[.\fR and \fB.]\fR
+stands for the
+sequence of characters of that collating element.
+The sequence is a single element of the bracket expression's list.
+A bracket expression in a locale that has
+multi-character collating elements
+can thus match more than one character.
+.VS 8.2
+So (insidiously), a bracket expression that starts with \fB^\fR
+can match multi-character collating elements even if none of them
+appear in the bracket expression!
+(\fINote:\fR Tcl currently has no multi-character collating elements.
+This information is only for illustration.)
+.PP
+For example, assume the collating sequence includes a \fBch\fR
+multi-character collating element.
+Then the RE \fB[[.ch.]]*c\fR (zero or more \fBch\fP's followed by \fBc\fP)
+matches the first five characters of `\fBchchcc\fR'.
+Also, the RE \fB[^c]b\fR matches all of `\fBchb\fR'
+(because \fB[^c]\fR matches the multi-character \fBch\fR).
+.VE 8.2
+.PP
+Within a bracket expression, a collating element enclosed in
+\fB[=\fR
+and
+\fB=]\fR
+is an equivalence class, standing for the sequences of characters
+of all collating elements equivalent to that one, including itself.
+(If there are no other equivalent collating elements,
+the treatment is as if the enclosing delimiters were `\fB[.\fR'\&
+and `\fB.]\fR'.)
+For example, if
+\fBo\fR
+and
+\fB\o'o^'\fR
+are the members of an equivalence class,
+then `\fB[[=o=]]\fR', `\fB[[=\o'o^'=]]\fR',
+and `\fB[o\o'o^']\fR'\&
+are all synonymous.
+An equivalence class may not be an endpoint
+of a range.
+.VS 8.2
+(\fINote:\fR
+Tcl currently implements only the Unicode locale.
+It doesn't define any equivalence classes.
+The examples above are just illustrations.)
+.VE 8.2
+.PP
+Within a bracket expression, the name of a \fIcharacter class\fR enclosed
+in
+\fB[:\fR
+and
+\fB:]\fR
+stands for the list of all characters
+(not all collating elements!)
+belonging to that
+class.
+Standard character classes are:
+.PP
+.RS
+.ne 5
+.nf
+.ta 3c
+\fBalpha\fR A letter.
+\fBupper\fR An upper-case letter.
+\fBlower\fR A lower-case letter.
+\fBdigit\fR A decimal digit.
+\fBxdigit\fR A hexadecimal digit.
+\fBalnum\fR An alphanumeric (letter or digit).
+\fBprint\fR An alphanumeric (same as alnum).
+\fBblank\fR A space or tab character.
+\fBspace\fR A character producing white space in displayed text.
+\fBpunct\fR A punctuation character.
+\fBgraph\fR A character with a visible representation.
+\fBcntrl\fR A control character.
+.fi
+.RE
+.PP
+A locale may provide others.
+.VS 8.2
+(Note that the current Tcl implementation has only one locale:
+the Unicode locale.)
+.VE 8.2
+A character class may not be used as an endpoint of a range.
+.PP
+There are two special cases of bracket expressions:
+the bracket expressions
+\fB[[:<:]]\fR
+and
+\fB[[:>:]]\fR
+are constraints, matching empty strings at
+the beginning and end of a word respectively.
+'\" note, discussion of escapes below references this definition of word
+A word is defined as a sequence of
+word characters
+that is neither preceded nor followed by
+word characters.
+A word character is an
+\fIalnum\fR
+character
+or an underscore
+(\fB_\fR).
+These special bracket expressions are deprecated;
+users of AREs should use constraint escapes instead (see below).
+.SH ESCAPES
+Escapes (AREs only), which begin with a
+\fB\e\fR
+followed by an alphanumeric character,
+come in several varieties:
+character entry, class shorthands, constraint escapes, and back references.
+A
+\fB\e\fR
+followed by an alphanumeric character but not constituting
+a valid escape is illegal in AREs.
+In EREs, there are no escapes:
+outside a bracket expression,
+a
+\fB\e\fR
+followed by an alphanumeric character merely stands for that
+character as an ordinary character,
+and inside a bracket expression,
+\fB\e\fR
+is an ordinary character.
+(The latter is the one actual incompatibility between EREs and AREs.)
+.PP
+Character-entry escapes (AREs only) exist to make it easier to specify
+non-printing and otherwise inconvenient characters in REs:
+.RS 2
+.TP 5
+\fB\ea\fR
+alert (bell) character, as in C
+.TP
+\fB\eb\fR
+backspace, as in C
+.TP
+\fB\eB\fR
+synonym for
+\fB\e\fR
+to help reduce backslash doubling in some
+applications where there are multiple levels of backslash processing
+.TP
+\fB\ec\fIX\fR
+(where X is any character) the character whose
+low-order 5 bits are the same as those of
+\fIX\fR,
+and whose other bits are all zero
+.TP
+\fB\ee\fR
+the character whose collating-sequence name
+is `\fBESC\fR',
+or failing that, the character with octal value 033
+.TP
+\fB\ef\fR
+formfeed, as in C
+.TP
+\fB\en\fR
+newline, as in C
+.TP
+\fB\er\fR
+carriage return, as in C
+.TP
+\fB\et\fR
+horizontal tab, as in C
+.TP
+\fB\eu\fIwxyz\fR
+(where
+\fIwxyz\fR
+is exactly four hexadecimal digits)
+the Unicode character
+\fBU+\fIwxyz\fR
+in the local byte ordering
+.TP
+\fB\eU\fIstuvwxyz\fR
+(where
+\fIstuvwxyz\fR
+is exactly eight hexadecimal digits)
+reserved for a somewhat-hypothetical Unicode extension to 32 bits
+.TP
+\fB\ev\fR
+vertical tab, as in C
+are all available.
+.TP
+\fB\ex\fIhhh\fR
+(where
+\fIhhh\fR
+is any sequence of hexadecimal digits)
+the character whose hexadecimal value is
+\fB0x\fIhhh\fR
+(a single character no matter how many hexadecimal digits are used).
+.TP
+\fB\e0\fR
+the character whose value is
+\fB0\fR
+.TP
+\fB\e\fIxy\fR
+(where
+\fIxy\fR
+is exactly two octal digits,
+and is not a
+\fIback reference\fR (see below))
+the character whose octal value is
+\fB0\fIxy\fR
+.TP
+\fB\e\fIxyz\fR
+(where
+\fIxyz\fR
+is exactly three octal digits,
+and is not a
+back reference (see below))
+the character whose octal value is
+\fB0\fIxyz\fR
+.RE
+.PP
+Hexadecimal digits are `\fB0\fR'-`\fB9\fR', `\fBa\fR'-`\fBf\fR',
+and `\fBA\fR'-`\fBF\fR'.
+Octal digits are `\fB0\fR'-`\fB7\fR'.
+.PP
+The character-entry escapes are always taken as ordinary characters.
+For example,
+\fB\e135\fR
+is
+\fB]\fR
+in ASCII,
+but
+\fB\e135\fR
+does not terminate a bracket expression.
+Beware, however, that some applications (e.g., C compilers) interpret
+such sequences themselves before the regular-expression package
+gets to see them, which may require doubling (quadrupling, etc.) the `\fB\e\fR'.
+.PP
+Class-shorthand escapes (AREs only) provide shorthands for certain commonly-used
+character classes:
+.RS 2
+.TP 10
+\fB\ed\fR
+\fB[[:digit:]]\fR
+.TP
+\fB\es\fR
+\fB[[:space:]]\fR
+.TP
+\fB\ew\fR
+\fB[[:alnum:]_]\fR
+(note underscore)
+.TP
+\fB\eD\fR
+\fB[^[:digit:]]\fR
+.TP
+\fB\eS\fR
+\fB[^[:space:]]\fR
+.TP
+\fB\eW\fR
+\fB[^[:alnum:]_]\fR
+(note underscore)
+.RE
+.PP
+Within bracket expressions, `\fB\ed\fR', `\fB\es\fR',
+and `\fB\ew\fR'\&
+lose their outer brackets,
+and `\fB\eD\fR', `\fB\eS\fR',
+and `\fB\eW\fR'\&
+are illegal.
+.VS 8.2
+(So, for example, \fB[a-c\ed]\fR is equivalent to \fB[a-c[:digit:]]\fR.
+Also, \fB[a-c\eD]\fR, which is equivalent to \fB[a-c^[:digit:]]\fR, is illegal.)
+.VE 8.2
+.PP
+A constraint escape (AREs only) is a constraint,
+matching the empty string if specific conditions are met,
+written as an escape:
+.RS 2
+.TP 6
+\fB\eA\fR
+matches only at the beginning of the string
+(see MATCHING, below, for how this differs from `\fB^\fR')
+.TP
+\fB\em\fR
+matches only at the beginning of a word
+.TP
+\fB\eM\fR
+matches only at the end of a word
+.TP
+\fB\ey\fR
+matches only at the beginning or end of a word
+.TP
+\fB\eY\fR
+matches only at a point that is not the beginning or end of a word
+.TP
+\fB\eZ\fR
+matches only at the end of the string
+(see MATCHING, below, for how this differs from `\fB$\fR')
+.TP
+\fB\e\fIm\fR
+(where
+\fIm\fR
+is a nonzero digit) a \fIback reference\fR, see below
+.TP
+\fB\e\fImnn\fR
+(where
+\fIm\fR
+is a nonzero digit, and
+\fInn\fR
+is some more digits,
+and the decimal value
+\fImnn\fR
+is not greater than the number of closing capturing parentheses seen so far)
+a \fIback reference\fR, see below
+.RE
+.PP
+A word is defined as in the specification of
+\fB[[:<:]]\fR
+and
+\fB[[:>:]]\fR
+above.
+Constraint escapes are illegal within bracket expressions.
+.PP
+A back reference (AREs only) matches the same string matched by the parenthesized
+subexpression specified by the number,
+so that (e.g.)
+\fB([bc])\e1\fR
+matches
+\fBbb\fR
+or
+\fBcc\fR
+but not `\fBbc\fR'.
+The subexpression must entirely precede the back reference in the RE.
+Subexpressions are numbered in the order of their leading parentheses.
+Non-capturing parentheses do not define subexpressions.
+.PP
+There is an inherent historical ambiguity between octal character-entry
+escapes and back references, which is resolved by heuristics,
+as hinted at above.
+A leading zero always indicates an octal escape.
+A single non-zero digit, not followed by another digit,
+is always taken as a back reference.
+A multi-digit sequence not starting with a zero is taken as a back
+reference if it comes after a suitable subexpression
+(i.e. the number is in the legal range for a back reference),
+and otherwise is taken as octal.
+.SH "METASYNTAX"
+In addition to the main syntax described above, there are some special
+forms and miscellaneous syntactic facilities available.
+.PP
+Normally the flavor of RE being used is specified by
+application-dependent means.
+However, this can be overridden by a \fIdirector\fR.
+If an RE of any flavor begins with `\fB***:\fR',
+the rest of the RE is an ARE.
+If an RE of any flavor begins with `\fB***=\fR',
+the rest of the RE is taken to be a literal string,
+with all characters considered ordinary characters.
+.PP
+An ARE may begin with \fIembedded options\fR:
+a sequence
+\fB(?\fIxyz\fB)\fR
+(where
+\fIxyz\fR
+is one or more alphabetic characters)
+specifies options affecting the rest of the RE.
+These supplement, and can override,
+any options specified by the application.
+The available option letters are:
+.RS 2
+.TP 3
+\fBb\fR
+rest of RE is a BRE
+.TP 3
+\fBc\fR
+case-sensitive matching (usual default)
+.TP 3
+\fBe\fR
+rest of RE is an ERE
+.TP 3
+\fBi\fR
+case-insensitive matching (see MATCHING, below)
+.TP 3
+\fBm\fR
+historical synonym for
+\fBn\fR
+.TP 3
+\fBn\fR
+newline-sensitive matching (see MATCHING, below)
+.TP 3
+\fBp\fR
+partial newline-sensitive matching (see MATCHING, below)
+.TP 3
+\fBq\fR
+rest of RE is a literal (``quoted'') string, all ordinary characters
+.TP 3
+\fBs\fR
+non-newline-sensitive matching (usual default)
+.TP 3
+\fBt\fR
+tight syntax (usual default; see below)
+.TP 3
+\fBw\fR
+inverse partial newline-sensitive (``weird'') matching (see MATCHING, below)
+.TP 3
+\fBx\fR
+expanded syntax (see below)
+.RE
+.PP
+Embedded options take effect at the
+\fB)\fR
+terminating the sequence.
+They are available only at the start of an ARE,
+and may not be used later within it.
+.PP
+In addition to the usual (\fItight\fR) RE syntax, in which all characters are
+significant, there is an \fIexpanded\fR syntax,
+available in all flavors of RE
+with the \fB-expanded\fR switch, or in AREs with the embedded x option.
+In the expanded syntax,
+white-space characters are ignored
+and all characters between a
+\fB#\fR
+and the following newline (or the end of the RE) are ignored,
+permitting paragraphing and commenting a complex RE.
+There are three exceptions to that basic rule:
+.RS 2
+.PP
+a white-space character or `\fB#\fR' preceded by `\fB\e\fR' is retained
+.PP
+white space or `\fB#\fR' within a bracket expression is retained
+.PP
+white space and comments are illegal within multi-character symbols
+like the ARE `\fB(?:\fR' or the BRE `\fB\e(\fR'
+.RE
+.PP
+Expanded-syntax white-space characters are blank, tab, newline, and
+.VS 8.2
+any character that belongs to the \fIspace\fR character class.
+.VE 8.2
+.PP
+Finally, in an ARE,
+outside bracket expressions, the sequence `\fB(?#\fIttt\fB)\fR'
+(where
+\fIttt\fR
+is any text not containing a `\fB)\fR')
+is a comment,
+completely ignored.
+Again, this is not allowed between the characters of
+multi-character symbols like `\fB(?:\fR'.
+Such comments are more a historical artifact than a useful facility,
+and their use is deprecated;
+use the expanded syntax instead.
+.PP
+\fINone\fR of these metasyntax extensions is available if the application
+(or an initial
+\fB***=\fR
+director)
+has specified that the user's input be treated as a literal string
+rather than as an RE.
+.SH MATCHING
+In the event that an RE could match more than one substring of a given
+string,
+the RE matches the one starting earliest in the string.
+If the RE could match more than one substring starting at that point,
+its choice is determined by its \fIpreference\fR:
+either the longest substring, or the shortest.
+.PP
+Most atoms, and all constraints, have no preference.
+A parenthesized RE has the same preference (possibly none) as the RE.
+A quantified atom with quantifier
+\fB{\fIm\fB}\fR
+or
+\fB{\fIm\fB}?\fR
+has the same preference (possibly none) as the atom itself.
+A quantified atom with other normal quantifiers (including
+\fB{\fIm\fB,\fIn\fB}\fR
+with
+\fIm\fR
+equal to
+\fIn\fR)
+prefers longest match.
+A quantified atom with other non-greedy quantifiers (including
+\fB{\fIm\fB,\fIn\fB}?\fR
+with
+\fIm\fR
+equal to
+\fIn\fR)
+prefers shortest match.
+A branch has the same preference as the first quantified atom in it
+which has a preference.
+An RE consisting of two or more branches connected by the
+\fB|\fR
+operator prefers longest match.
+.PP
+Subject to the constraints imposed by the rules for matching the whole RE,
+subexpressions also match the longest or shortest possible substrings,
+based on their preferences,
+with subexpressions starting earlier in the RE taking priority over
+ones starting later.
+Note that outer subexpressions thus take priority over
+their component subexpressions.
+.PP
+Note that the quantifiers
+\fB{1,1}\fR
+and
+\fB{1,1}?\fR
+can be used to force longest and shortest preference, respectively,
+on a subexpression or a whole RE.
+.PP
+Match lengths are measured in characters, not collating elements.
+An empty string is considered longer than no match at all.
+For example,
+\fBbb*\fR
+matches the three middle characters of `\fBabbbc\fR',
+\fB(week|wee)(night|knights)\fR
+matches all ten characters of `\fBweeknights\fR',
+when
+\fB(.*).*\fR
+is matched against
+\fBabc\fR
+the parenthesized subexpression
+matches all three characters, and
+when
+\fB(a*)*\fR
+is matched against
+\fBbc\fR
+both the whole RE and the parenthesized
+subexpression match an empty string.
+.PP
+If case-independent matching is specified,
+the effect is much as if all case distinctions had vanished from the
+alphabet.
+When an alphabetic that exists in multiple cases appears as an
+ordinary character outside a bracket expression, it is effectively
+transformed into a bracket expression containing both cases,
+so that
+\fBx\fR
+becomes `\fB[xX]\fR'.
+When it appears inside a bracket expression, all case counterparts
+of it are added to the bracket expression, so that
+\fB[x]\fR
+becomes
+\fB[xX]\fR
+and
+\fB[^x]\fR
+becomes `\fB[^xX]\fR'.
+.PP
+If newline-sensitive matching is specified, \fB.\fR
+and bracket expressions using
+\fB^\fR
+will never match the newline character
+(so that matches will never cross newlines unless the RE
+explicitly arranges it)
+and
+\fB^\fR
+and
+\fB$\fR
+will match the empty string after and before a newline
+respectively, in addition to matching at beginning and end of string
+respectively.
+ARE
+\fB\eA\fR
+and
+\fB\eZ\fR
+continue to match beginning or end of string \fIonly\fR.
+.PP
+If partial newline-sensitive matching is specified,
+this affects \fB.\fR
+and bracket expressions
+as with newline-sensitive matching, but not
+\fB^\fR
+and `\fB$\fR'.
+.PP
+If inverse partial newline-sensitive matching is specified,
+this affects
+\fB^\fR
+and
+\fB$\fR
+as with
+newline-sensitive matching,
+but not \fB.\fR
+and bracket expressions.
+This isn't very useful but is provided for symmetry.
+.SH "LIMITS AND COMPATIBILITY"
+No particular limit is imposed on the length of REs.
+Programs intended to be highly portable should not employ REs longer
+than 256 bytes,
+as a POSIX-compliant implementation can refuse to accept such REs.
+.PP
+The only feature of AREs that is actually incompatible with
+POSIX EREs is that
+\fB\e\fR
+does not lose its special
+significance inside bracket expressions.
+All other ARE features use syntax which is illegal or has
+undefined or unspecified effects in POSIX EREs;
+the
+\fB***\fR
+syntax of directors likewise is outside the POSIX
+syntax for both BREs and EREs.
+.PP
+Many of the ARE extensions are borrowed from Perl, but some have
+been changed to clean them up, and a few Perl extensions are not present.
+Incompatibilities of note include `\fB\eb\fR', `\fB\eB\fR',
+the lack of special treatment for a trailing newline,
+the addition of complemented bracket expressions to the things
+affected by newline-sensitive matching,
+the restrictions on parentheses and back references in lookahead constraints,
+and the longest/shortest-match (rather than first-match) matching semantics.
+.PP
+The matching rules for REs containing both normal and non-greedy quantifiers
+have changed since early beta-test versions of this package.
+(The new rules are much simpler and cleaner,
+but don't work as hard at guessing the user's real intentions.)
+.PP
+Henry Spencer's original 1986 \fIregexp\fR package,
+still in widespread use (e.g., in pre-8.1 releases of Tcl),
+implemented an early version of today's EREs.
+There are four incompatibilities between \fIregexp\fR's near-EREs
+(`RREs' for short) and AREs.
+In roughly increasing order of significance:
+.PP
+.RS
+In AREs,
+\fB\e\fR
+followed by an alphanumeric character is either an
+escape or an error,
+while in RREs, it was just another way of writing the
+alphanumeric.
+This should not be a problem because there was no reason to write
+such a sequence in RREs.
+.PP
+\fB{\fR
+followed by a digit in an ARE is the beginning of a bound,
+while in RREs,
+\fB{\fR
+was always an ordinary character.
+Such sequences should be rare,
+and will often result in an error because following characters
+will not look like a valid bound.
+.PP
+In AREs,
+\fB\e\fR
+remains a special character within `\fB[\|]\fR',
+so a literal
+\fB\e\fR
+within
+\fB[\|]\fR
+must be written `\fB\e\e\fR'.
+\fB\e\e\fR
+also gives a literal
+\fB\e\fR
+within
+\fB[\|]\fR
+in RREs,
+but only truly paranoid programmers routinely doubled the backslash.
+.PP
+AREs report the longest/shortest match for the RE,
+rather than the first found in a specified search order.
+This may affect some RREs which were written in the expectation that
+the first match would be reported.
+(The careful crafting of RREs to optimize the search order for fast
+matching is obsolete (AREs examine all possible matches
+in parallel, and their performance is largely insensitive to their
+complexity) but cases where the search order was exploited to deliberately
+find a match which was \fInot\fR the longest/shortest will need rewriting.)
+.RE
+
+.SH "BASIC REGULAR EXPRESSIONS"
+BREs differ from EREs in several respects. `\fB|\fR', `\fB+\fR',
+and
+\fB?\fR
+are ordinary characters and there is no equivalent
+for their functionality.
+The delimiters for bounds are
+\fB\e{\fR
+and `\fB\e}\fR',
+with
+\fB{\fR
+and
+\fB}\fR
+by themselves ordinary characters.
+The parentheses for nested subexpressions are
+\fB\e(\fR
+and `\fB\e)\fR',
+with
+\fB(\fR
+and
+\fB)\fR
+by themselves ordinary characters.
+\fB^\fR
+is an ordinary character except at the beginning of the
+RE or the beginning of a parenthesized subexpression,
+\fB$\fR
+is an ordinary character except at the end of the
+RE or the end of a parenthesized subexpression,
+and
+\fB*\fR
+is an ordinary character if it appears at the beginning of the
+RE or the beginning of a parenthesized subexpression
+(after a possible leading `\fB^\fR').
+Finally,
+single-digit back references are available,
+and
+\fB\e<\fR
+and
+\fB\e>\fR
+are synonyms for
+\fB[[:<:]]\fR
+and
+\fB[[:>:]]\fR
+respectively;
+no other escapes are available.
+
+.SH "SEE ALSO"
+RegExp(3), regexp(n), regsub(n), lsearch(n), switch(n), text(n)
+
+.SH KEYWORDS
+match, regular expression, string
--- /dev/null
+/*
+ * colorings of characters
+ * This file is #included by regcomp.c.
+ *
+ * Copyright (c) 1998, 1999 Henry Spencer. All rights reserved.
+ *
+ * Development of this software was funded, in part, by Cray Research Inc.,
+ * UUNET Communications Services Inc., Sun Microsystems Inc., and Scriptics
+ * Corporation, none of whom are responsible for the results. The author
+ * thanks all of them.
+ *
+ * Redistribution and use in source and binary forms -- with or without
+ * modification -- are permitted for any purpose, provided that
+ * redistributions in source form retain this entire copyright notice and
+ * indicate the origin and nature of any modifications.
+ *
+ * I'd appreciate being given credit for this package in the documentation
+ * of software which uses it, but that is not a requirement.
+ *
+ * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES,
+ * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+ * AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL
+ * HENRY SPENCER BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
+ * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
+ * WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR
+ * OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
+ * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ * $Header: /cvsroot/pgsql/src/backend/regex/regc_color.c,v 1.1 2003/02/05 17:41:32 tgl Exp $
+ *
+ *
+ * Note that there are some incestuous relationships between this code and
+ * NFA arc maintenance, which perhaps ought to be cleaned up sometime.
+ */
+
+
+
+#define CISERR() VISERR(cm->v)
+#define CERR(e) VERR(cm->v, (e))
+
+
+
+/*
+ * initcm - set up new colormap
+ */
+static void
+initcm(struct vars *v,
+ struct colormap *cm)
+{
+ int i;
+ int j;
+ union tree *t;
+ union tree *nextt;
+ struct colordesc *cd;
+
+ cm->magic = CMMAGIC;
+ cm->v = v;
+
+ cm->ncds = NINLINECDS;
+ cm->cd = cm->cdspace;
+ cm->max = 0;
+ cm->free = 0;
+
+ cd = cm->cd; /* cm->cd[WHITE] */
+ cd->sub = NOSUB;
+ cd->arcs = NULL;
+ cd->flags = 0;
+ cd->nchrs = CHR_MAX - CHR_MIN + 1;
+
+ /* upper levels of tree */
+ for (t = &cm->tree[0], j = NBYTS-1; j > 0; t = nextt, j--) {
+ nextt = t + 1;
+ for (i = BYTTAB-1; i >= 0; i--)
+ t->tptr[i] = nextt;
+ }
+ /* bottom level is solid white */
+ t = &cm->tree[NBYTS-1];
+ for (i = BYTTAB-1; i >= 0; i--)
+ t->tcolor[i] = WHITE;
+ cd->block = t;
+}
+
+/*
+ * freecm - free dynamically-allocated things in a colormap
+ */
+static void
+freecm(struct colormap *cm)
+{
+ size_t i;
+ union tree *cb;
+
+ cm->magic = 0;
+ if (NBYTS > 1)
+ cmtreefree(cm, cm->tree, 0);
+ for (i = 1; i <= cm->max; i++) /* skip WHITE */
+ if (!UNUSEDCOLOR(&cm->cd[i])) {
+ cb = cm->cd[i].block;
+ if (cb != NULL)
+ FREE(cb);
+ }
+ if (cm->cd != cm->cdspace)
+ FREE(cm->cd);
+}
+
+/*
+ * cmtreefree - free a non-terminal part of a colormap tree
+ */
+static void
+cmtreefree(struct colormap *cm,
+ union tree *tree,
+ int level) /* level number (top == 0) of this block */
+{
+ int i;
+ union tree *t;
+ union tree *fillt = &cm->tree[level+1];
+ union tree *cb;
+
+ assert(level < NBYTS-1); /* this level has pointers */
+ for (i = BYTTAB-1; i >= 0; i--) {
+ t = tree->tptr[i];
+ assert(t != NULL);
+ if (t != fillt) {
+ if (level < NBYTS-2) { /* more pointer blocks below */
+ cmtreefree(cm, t, level+1);
+ FREE(t);
+ } else { /* color block below */
+ cb = cm->cd[t->tcolor[0]].block;
+ if (t != cb) /* not a solid block */
+ FREE(t);
+ }
+ }
+ }
+}
+
+/*
+ * setcolor - set the color of a character in a colormap
+ */
+static color /* previous color */
+setcolor(struct colormap *cm,
+ chr c,
+ pcolor co)
+{
+ uchr uc = c;
+ int shift;
+ int level;
+ int b;
+ int bottom;
+ union tree *t;
+ union tree *newt;
+ union tree *fillt;
+ union tree *lastt;
+ union tree *cb;
+ color prev;
+
+ assert(cm->magic == CMMAGIC);
+ if (CISERR() || co == COLORLESS)
+ return COLORLESS;
+
+ t = cm->tree;
+ for (level = 0, shift = BYTBITS * (NBYTS - 1); shift > 0;
+ level++, shift -= BYTBITS) {
+ b = (uc >> shift) & BYTMASK;
+ lastt = t;
+ t = lastt->tptr[b];
+ assert(t != NULL);
+ fillt = &cm->tree[level+1];
+ bottom = (shift <= BYTBITS) ? 1 : 0;
+ cb = (bottom) ? cm->cd[t->tcolor[0]].block : fillt;
+ if (t == fillt || t == cb) { /* must allocate a new block */
+ newt = (union tree *)MALLOC((bottom) ?
+ sizeof(struct colors) : sizeof(struct ptrs));
+ if (newt == NULL) {
+ CERR(REG_ESPACE);
+ return COLORLESS;
+ }
+ if (bottom)
+ memcpy(VS(newt->tcolor), VS(t->tcolor),
+ BYTTAB*sizeof(color));
+ else
+ memcpy(VS(newt->tptr), VS(t->tptr),
+ BYTTAB*sizeof(union tree *));
+ t = newt;
+ lastt->tptr[b] = t;
+ }
+ }
+
+ b = uc & BYTMASK;
+ prev = t->tcolor[b];
+ t->tcolor[b] = (color)co;
+ return prev;
+}
+
+/*
+ * maxcolor - report largest color number in use
+ */
+static color
+maxcolor(struct colormap *cm)
+{
+ if (CISERR())
+ return COLORLESS;
+
+ return (color)cm->max;
+}
+
+/*
+ * newcolor - find a new color (must be subject of setcolor at once)
+ * Beware: may relocate the colordescs.
+ */
+static color /* COLORLESS for error */
+newcolor(struct colormap *cm)
+{
+ struct colordesc *cd;
+ struct colordesc *new;
+ size_t n;
+
+ if (CISERR())
+ return COLORLESS;
+
+ if (cm->free != 0) {
+ assert(cm->free > 0);
+ assert((size_t)cm->free < cm->ncds);
+ cd = &cm->cd[cm->free];
+ assert(UNUSEDCOLOR(cd));
+ assert(cd->arcs == NULL);
+ cm->free = cd->sub;
+ } else if (cm->max < cm->ncds - 1) {
+ cm->max++;
+ cd = &cm->cd[cm->max];
+ } else {
+ /* oops, must allocate more */
+ n = cm->ncds * 2;
+ if (cm->cd == cm->cdspace) {
+ new = (struct colordesc *)MALLOC(n *
+ sizeof(struct colordesc));
+ if (new != NULL)
+ memcpy(VS(new), VS(cm->cdspace), cm->ncds *
+ sizeof(struct colordesc));
+ } else
+ new = (struct colordesc *)REALLOC(cm->cd,
+ n * sizeof(struct colordesc));
+ if (new == NULL) {
+ CERR(REG_ESPACE);
+ return COLORLESS;
+ }
+ cm->cd = new;
+ cm->ncds = n;
+ assert(cm->max < cm->ncds - 1);
+ cm->max++;
+ cd = &cm->cd[cm->max];
+ }
+
+ cd->nchrs = 0;
+ cd->sub = NOSUB;
+ cd->arcs = NULL;
+ cd->flags = 0;
+ cd->block = NULL;
+
+ return (color)(cd - cm->cd);
+}
+
+/*
+ * freecolor - free a color (must have no arcs or subcolor)
+ */
+static void
+freecolor(struct colormap *cm,
+ pcolor co)
+{
+ struct colordesc *cd = &cm->cd[co];
+ color pco, nco; /* for freelist scan */
+
+ assert(co >= 0);
+ if (co == WHITE)
+ return;
+
+ assert(cd->arcs == NULL);
+ assert(cd->sub == NOSUB);
+ assert(cd->nchrs == 0);
+ cd->flags = FREECOL;
+ if (cd->block != NULL) {
+ FREE(cd->block);
+ cd->block = NULL; /* just paranoia */
+ }
+
+ if ((size_t)co == cm->max) {
+ while (cm->max > WHITE && UNUSEDCOLOR(&cm->cd[cm->max]))
+ cm->max--;
+ assert(cm->free >= 0);
+ while ((size_t)cm->free > cm->max)
+ cm->free = cm->cd[cm->free].sub;
+ if (cm->free > 0) {
+ assert(cm->free < cm->max);
+ pco = cm->free;
+ nco = cm->cd[pco].sub;
+ while (nco > 0)
+ if ((size_t)nco > cm->max) {
+ /* take this one out of freelist */
+ nco = cm->cd[nco].sub;
+ cm->cd[pco].sub = nco;
+ } else {
+ assert(nco < cm->max);
+ pco = nco;
+ nco = cm->cd[pco].sub;
+ }
+ }
+ } else {
+ cd->sub = cm->free;
+ cm->free = (color)(cd - cm->cd);
+ }
+}
+
+/*
+ * pseudocolor - allocate a false color, to be managed by other means
+ */
+static color
+pseudocolor(struct colormap *cm)
+{
+ color co;
+
+ co = newcolor(cm);
+ if (CISERR())
+ return COLORLESS;
+ cm->cd[co].nchrs = 1;
+ cm->cd[co].flags = PSEUDO;
+ return co;
+}
+
+/*
+ * subcolor - allocate a new subcolor (if necessary) to this chr
+ */
+static color
+subcolor(struct colormap *cm, chr c)
+{
+ color co; /* current color of c */
+ color sco; /* new subcolor */
+
+ co = GETCOLOR(cm, c);
+ sco = newsub(cm, co);
+ if (CISERR())
+ return COLORLESS;
+ assert(sco != COLORLESS);
+
+ if (co == sco) /* already in an open subcolor */
+ return co; /* rest is redundant */
+ cm->cd[co].nchrs--;
+ cm->cd[sco].nchrs++;
+ setcolor(cm, c, sco);
+ return sco;
+}
+
+/*
+ * newsub - allocate a new subcolor (if necessary) for a color
+ */
+static color
+newsub(struct colormap *cm,
+ pcolor co)
+{
+ color sco; /* new subcolor */
+
+ sco = cm->cd[co].sub;
+ if (sco == NOSUB) { /* color has no open subcolor */
+ if (cm->cd[co].nchrs == 1) /* optimization */
+ return co;
+ sco = newcolor(cm); /* must create subcolor */
+ if (sco == COLORLESS) {
+ assert(CISERR());
+ return COLORLESS;
+ }
+ cm->cd[co].sub = sco;
+ cm->cd[sco].sub = sco; /* open subcolor points to self */
+ }
+ assert(sco != NOSUB);
+
+ return sco;
+}
+
+/*
+ * subrange - allocate new subcolors to this range of chrs, fill in arcs
+ */
+static void
+subrange(struct vars *v,
+ chr from,
+ chr to,
+ struct state *lp,
+ struct state *rp)
+{
+ uchr uf;
+ int i;
+
+ assert(from <= to);
+
+ /* first, align "from" on a tree-block boundary */
+ uf = (uchr)from;
+ i = (int)( ((uf + BYTTAB-1) & (uchr)~BYTMASK) - uf );
+ for (; from <= to && i > 0; i--, from++)
+ newarc(v->nfa, PLAIN, subcolor(v->cm, from), lp, rp);
+ if (from > to) /* didn't reach a boundary */
+ return;
+
+ /* deal with whole blocks */
+ for (; to - from >= BYTTAB; from += BYTTAB)
+ subblock(v, from, lp, rp);
+
+ /* clean up any remaining partial table */
+ for (; from <= to; from++)
+ newarc(v->nfa, PLAIN, subcolor(v->cm, from), lp, rp);
+}
+
+/*
+ * subblock - allocate new subcolors for one tree block of chrs, fill in arcs
+ */
+static void
+subblock(struct vars *v,
+ chr start, /* first of BYTTAB chrs */
+ struct state *lp,
+ struct state *rp)
+{
+ uchr uc = start;
+ struct colormap *cm = v->cm;
+ int shift;
+ int level;
+ int i;
+ int b;
+ union tree *t;
+ union tree *cb;
+ union tree *fillt;
+ union tree *lastt;
+ int previ;
+ int ndone;
+ color co;
+ color sco;
+
+ assert((uc % BYTTAB) == 0);
+
+ /* find its color block, making new pointer blocks as needed */
+ t = cm->tree;
+ fillt = NULL;
+ for (level = 0, shift = BYTBITS * (NBYTS - 1); shift > 0;
+ level++, shift -= BYTBITS) {
+ b = (uc >> shift) & BYTMASK;
+ lastt = t;
+ t = lastt->tptr[b];
+ assert(t != NULL);
+ fillt = &cm->tree[level+1];
+ if (t == fillt && shift > BYTBITS) { /* need new ptr block */
+ t = (union tree *)MALLOC(sizeof(struct ptrs));
+ if (t == NULL) {
+ CERR(REG_ESPACE);
+ return;
+ }
+ memcpy(VS(t->tptr), VS(fillt->tptr),
+ BYTTAB*sizeof(union tree *));
+ lastt->tptr[b] = t;
+ }
+ }
+
+ /* special cases: fill block or solid block */
+ co = t->tcolor[0];
+ cb = cm->cd[co].block;
+ if (t == fillt || t == cb) {
+ /* either way, we want a subcolor solid block */
+ sco = newsub(cm, co);
+ t = cm->cd[sco].block;
+ if (t == NULL) { /* must set it up */
+ t = (union tree *)MALLOC(sizeof(struct colors));
+ if (t == NULL) {
+ CERR(REG_ESPACE);
+ return;
+ }
+ for (i = 0; i < BYTTAB; i++)
+ t->tcolor[i] = sco;
+ cm->cd[sco].block = t;
+ }
+ /* find loop must have run at least once */
+ lastt->tptr[b] = t;
+ newarc(v->nfa, PLAIN, sco, lp, rp);
+ cm->cd[co].nchrs -= BYTTAB;
+ cm->cd[sco].nchrs += BYTTAB;
+ return;
+ }
+
+ /* general case, a mixed block to be altered */
+ i = 0;
+ while (i < BYTTAB) {
+ co = t->tcolor[i];
+ sco = newsub(cm, co);
+ newarc(v->nfa, PLAIN, sco, lp, rp);
+ previ = i;
+ do {
+ t->tcolor[i++] = sco;
+ } while (i < BYTTAB && t->tcolor[i] == co);
+ ndone = i - previ;
+ cm->cd[co].nchrs -= ndone;
+ cm->cd[sco].nchrs += ndone;
+ }
+}
+
+/*
+ * okcolors - promote subcolors to full colors
+ */
+static void
+okcolors(struct nfa *nfa,
+ struct colormap *cm)
+{
+ struct colordesc *cd;
+ struct colordesc *end = CDEND(cm);
+ struct colordesc *scd;
+ struct arc *a;
+ color co;
+ color sco;
+
+ for (cd = cm->cd, co = 0; cd < end; cd++, co++) {
+ sco = cd->sub;
+ if (UNUSEDCOLOR(cd) || sco == NOSUB) {
+ /* has no subcolor, no further action */
+ } else if (sco == co) {
+ /* is subcolor, let parent deal with it */
+ } else if (cd->nchrs == 0) {
+ /* parent empty, its arcs change color to subcolor */
+ cd->sub = NOSUB;
+ scd = &cm->cd[sco];
+ assert(scd->nchrs > 0);
+ assert(scd->sub == sco);
+ scd->sub = NOSUB;
+ while ((a = cd->arcs) != NULL) {
+ assert(a->co == co);
+ /* uncolorchain(cm, a); */
+ cd->arcs = a->colorchain;
+ a->co = sco;
+ /* colorchain(cm, a); */
+ a->colorchain = scd->arcs;
+ scd->arcs = a;
+ }
+ freecolor(cm, co);
+ } else {
+ /* parent's arcs must gain parallel subcolor arcs */
+ cd->sub = NOSUB;
+ scd = &cm->cd[sco];
+ assert(scd->nchrs > 0);
+ assert(scd->sub == sco);
+ scd->sub = NOSUB;
+ for (a = cd->arcs; a != NULL; a = a->colorchain) {
+ assert(a->co == co);
+ newarc(nfa, a->type, sco, a->from, a->to);
+ }
+ }
+ }
+}
+
+/*
+ * colorchain - add this arc to the color chain of its color
+ */
+static void
+colorchain(struct colormap *cm,
+ struct arc *a)
+{
+ struct colordesc *cd = &cm->cd[a->co];
+
+ a->colorchain = cd->arcs;
+ cd->arcs = a;
+}
+
+/*
+ * uncolorchain - delete this arc from the color chain of its color
+ */
+static void
+uncolorchain(struct colormap *cm,
+ struct arc *a)
+{
+ struct colordesc *cd = &cm->cd[a->co];
+ struct arc *aa;
+
+ aa = cd->arcs;
+ if (aa == a) /* easy case */
+ cd->arcs = a->colorchain;
+ else {
+ for (; aa != NULL && aa->colorchain != a; aa = aa->colorchain)
+ continue;
+ assert(aa != NULL);
+ aa->colorchain = a->colorchain;
+ }
+ a->colorchain = NULL; /* paranoia */
+}
+
+/*
+ * singleton - is this character in its own color?
+ */
+static int /* predicate */
+singleton(struct colormap *cm,
+ chr c)
+{
+ color co; /* color of c */
+
+ co = GETCOLOR(cm, c);
+ if (cm->cd[co].nchrs == 1 && cm->cd[co].sub == NOSUB)
+ return 1;
+ return 0;
+}
+
+/*
+ * rainbow - add arcs of all full colors (but one) between specified states
+ */
+static void
+rainbow(struct nfa *nfa,
+ struct colormap *cm,
+ int type,
+ pcolor but, /* COLORLESS if no exceptions */
+ struct state *from,
+ struct state *to)
+{
+ struct colordesc *cd;
+ struct colordesc *end = CDEND(cm);
+ color co;
+
+ for (cd = cm->cd, co = 0; cd < end && !CISERR(); cd++, co++)
+ if (!UNUSEDCOLOR(cd) && cd->sub != co && co != but &&
+ !(cd->flags&PSEUDO))
+ newarc(nfa, type, co, from, to);
+}
+
+/*
+ * colorcomplement - add arcs of complementary colors
+ *
+ * The calling sequence ought to be reconciled with cloneouts().
+ */
+static void
+colorcomplement(struct nfa *nfa,
+ struct colormap *cm,
+ int type,
+ struct state *of, /* complements of this guy's PLAIN outarcs */
+ struct state *from,
+ struct state *to)
+{
+ struct colordesc *cd;
+ struct colordesc *end = CDEND(cm);
+ color co;
+
+ assert(of != from);
+ for (cd = cm->cd, co = 0; cd < end && !CISERR(); cd++, co++)
+ if (!UNUSEDCOLOR(cd) && !(cd->flags&PSEUDO))
+ if (findarc(of, PLAIN, co) == NULL)
+ newarc(nfa, type, co, from, to);
+}
+
+
+#ifdef REG_DEBUG
+
+/*
+ * dumpcolors - debugging output
+ */
+static void
+dumpcolors(struct colormap *cm,
+ FILE *f)
+{
+ struct colordesc *cd;
+ struct colordesc *end;
+ color co;
+ chr c;
+ char *has;
+
+ fprintf(f, "max %ld\n", (long)cm->max);
+ if (NBYTS > 1)
+ fillcheck(cm, cm->tree, 0, f);
+ end = CDEND(cm);
+ for (cd = cm->cd + 1, co = 1; cd < end; cd++, co++) /* skip 0 */
+ if (!UNUSEDCOLOR(cd)) {
+ assert(cd->nchrs > 0);
+ has = (cd->block != NULL) ? "#" : "";
+ if (cd->flags&PSEUDO)
+ fprintf(f, "#%2ld%s(ps): ", (long)co, has);
+ else
+ fprintf(f, "#%2ld%s(%2d): ", (long)co,
+ has, cd->nchrs);
+ /* it's hard to do this more efficiently */
+ for (c = CHR_MIN; c < CHR_MAX; c++)
+ if (GETCOLOR(cm, c) == co)
+ dumpchr(c, f);
+ assert(c == CHR_MAX);
+ if (GETCOLOR(cm, c) == co)
+ dumpchr(c, f);
+ fprintf(f, "\n");
+ }
+}
+
+/*
+ * fillcheck - check proper filling of a tree
+ */
+static void
+fillcheck(struct colormap *cm,
+ union tree *tree,
+ int level, /* level number (top == 0) of this block */
+ FILE *f)
+{
+ int i;
+ union tree *t;
+ union tree *fillt = &cm->tree[level+1];
+
+ assert(level < NBYTS-1); /* this level has pointers */
+ for (i = BYTTAB-1; i >= 0; i--) {
+ t = tree->tptr[i];
+ if (t == NULL)
+ fprintf(f, "NULL found in filled tree!\n");
+ else if (t == fillt)
+ {}
+ else if (level < NBYTS-2) /* more pointer blocks below */
+ fillcheck(cm, t, level+1, f);
+ }
+}
+
+/*
+ * dumpchr - print a chr
+ *
+ * Kind of char-centric but works well enough for debug use.
+ */
+static void
+dumpchr(chr c,
+ FILE *f)
+{
+ if (c == '\\')
+ fprintf(f, "\\\\");
+ else if (c > ' ' && c <= '~')
+ putc((char)c, f);
+ else
+ fprintf(f, "\\u%04lx", (long)c);
+}
+
+#endif /* REG_DEBUG */
--- /dev/null
+/*
+ * Utility functions for handling cvecs
+ * This file is #included by regcomp.c.
+ *
+ * Copyright (c) 1998, 1999 Henry Spencer. All rights reserved.
+ *
+ * Development of this software was funded, in part, by Cray Research Inc.,
+ * UUNET Communications Services Inc., Sun Microsystems Inc., and Scriptics
+ * Corporation, none of whom are responsible for the results. The author
+ * thanks all of them.
+ *
+ * Redistribution and use in source and binary forms -- with or without
+ * modification -- are permitted for any purpose, provided that
+ * redistributions in source form retain this entire copyright notice and
+ * indicate the origin and nature of any modifications.
+ *
+ * I'd appreciate being given credit for this package in the documentation
+ * of software which uses it, but that is not a requirement.
+ *
+ * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES,
+ * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+ * AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL
+ * HENRY SPENCER BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
+ * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
+ * WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR
+ * OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
+ * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ * $Header: /cvsroot/pgsql/src/backend/regex/regc_cvec.c,v 1.1 2003/02/05 17:41:32 tgl Exp $
+ *
+ */
+
+/*
+ * newcvec - allocate a new cvec
+ */
+static struct cvec *
+newcvec(int nchrs, /* to hold this many chrs... */
+ int nranges, /* ... and this many ranges... */
+ int nmcces) /* ... and this many MCCEs */
+{
+ size_t n;
+ size_t nc;
+ struct cvec *cv;
+
+ nc = (size_t)nchrs + (size_t)nmcces*(MAXMCCE+1) + (size_t)nranges*2;
+ n = sizeof(struct cvec) + (size_t)(nmcces-1)*sizeof(chr *)
+ + nc*sizeof(chr);
+ cv = (struct cvec *)MALLOC(n);
+ if (cv == NULL) {
+ return NULL;
+ }
+ cv->chrspace = nchrs;
+ cv->chrs = (chr *)&cv->mcces[nmcces]; /* chrs just after MCCE ptrs */
+ cv->mccespace = nmcces;
+ cv->ranges = cv->chrs + nchrs + nmcces*(MAXMCCE+1);
+ cv->rangespace = nranges;
+ return clearcvec(cv);
+}
+
+/*
+ * clearcvec - clear a possibly-new cvec
+ * Returns pointer as convenience.
+ */
+static struct cvec *
+clearcvec(struct cvec *cv)
+{
+ int i;
+
+ assert(cv != NULL);
+ cv->nchrs = 0;
+ assert(cv->chrs == (chr *)&cv->mcces[cv->mccespace]);
+ cv->nmcces = 0;
+ cv->nmccechrs = 0;
+ cv->nranges = 0;
+ for (i = 0; i < cv->mccespace; i++) {
+ cv->mcces[i] = NULL;
+ }
+
+ return cv;
+}
+
+/*
+ * addchr - add a chr to a cvec
+ */
+static void
+addchr(struct cvec *cv, /* character vector */
+ chr c) /* character to add */
+{
+ assert(cv->nchrs < cv->chrspace - cv->nmccechrs);
+ cv->chrs[cv->nchrs++] = (chr)c;
+}
+
+/*
+ * addrange - add a range to a cvec
+ */
+static void
+addrange(struct cvec *cv, /* character vector */
+ chr from, /* first character of range */
+ chr to) /* last character of range */
+{
+ assert(cv->nranges < cv->rangespace);
+ cv->ranges[cv->nranges*2] = (chr)from;
+ cv->ranges[cv->nranges*2 + 1] = (chr)to;
+ cv->nranges++;
+}
+
+/*
+ * addmcce - add an MCCE to a cvec
+ */
+static void
+addmcce(struct cvec *cv, /* character vector */
+ chr *startp, /* beginning of text */
+ chr *endp) /* just past end of text */
+{
+ int len;
+ int i;
+ chr *s;
+ chr *d;
+
+ if (startp == NULL && endp == NULL) {
+ return;
+ }
+ len = endp - startp;
+ assert(len > 0);
+ assert(cv->nchrs + len < cv->chrspace - cv->nmccechrs);
+ assert(cv->nmcces < cv->mccespace);
+ d = &cv->chrs[cv->chrspace - cv->nmccechrs - len - 1];
+ cv->mcces[cv->nmcces++] = d;
+ for (s = startp, i = len; i > 0; s++, i--) {
+ *d++ = *s;
+ }
+ *d++ = 0; /* endmarker */
+ assert(d == &cv->chrs[cv->chrspace - cv->nmccechrs]);
+ cv->nmccechrs += len + 1;
+}
+
+/*
+ * haschr - does a cvec contain this chr?
+ */
+static int /* predicate */
+haschr(struct cvec *cv, /* character vector */
+ chr c) /* character to test for */
+{
+ int i;
+ chr *p;
+
+ for (p = cv->chrs, i = cv->nchrs; i > 0; p++, i--) {
+ if (*p == c) {
+ return 1;
+ }
+ }
+ for (p = cv->ranges, i = cv->nranges; i > 0; p += 2, i--) {
+ if ((*p <= c) && (c <= *(p+1))) {
+ return 1;
+ }
+ }
+ return 0;
+}
+
+/*
+ * getcvec - get a cvec, remembering it as v->cv
+ */
+static struct cvec *
+getcvec(struct vars *v, /* context */
+ int nchrs, /* to hold this many chrs... */
+ int nranges, /* ... and this many ranges... */
+ int nmcces) /* ... and this many MCCEs */
+{
+ if (v->cv != NULL && nchrs <= v->cv->chrspace &&
+ nranges <= v->cv->rangespace && nmcces <= v->cv->mccespace) {
+ return clearcvec(v->cv);
+ }
+
+ if (v->cv != NULL) {
+ freecvec(v->cv);
+ }
+ v->cv = newcvec(nchrs, nranges, nmcces);
+ if (v->cv == NULL) {
+ ERR(REG_ESPACE);
+ }
+
+ return v->cv;
+}
+
+/*
+ * freecvec - free a cvec
+ */
+static void
+freecvec(struct cvec *cv)
+{
+ FREE(cv);
+}
--- /dev/null
+/*
+ * lexical analyzer
+ * This file is #included by regcomp.c.
+ *
+ * Copyright (c) 1998, 1999 Henry Spencer. All rights reserved.
+ *
+ * Development of this software was funded, in part, by Cray Research Inc.,
+ * UUNET Communications Services Inc., Sun Microsystems Inc., and Scriptics
+ * Corporation, none of whom are responsible for the results. The author
+ * thanks all of them.
+ *
+ * Redistribution and use in source and binary forms -- with or without
+ * modification -- are permitted for any purpose, provided that
+ * redistributions in source form retain this entire copyright notice and
+ * indicate the origin and nature of any modifications.
+ *
+ * I'd appreciate being given credit for this package in the documentation
+ * of software which uses it, but that is not a requirement.
+ *
+ * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES,
+ * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+ * AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL
+ * HENRY SPENCER BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
+ * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
+ * WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR
+ * OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
+ * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ * $Header: /cvsroot/pgsql/src/backend/regex/regc_lex.c,v 1.1 2003/02/05 17:41:32 tgl Exp $
+ *
+ */
+
+/* scanning macros (know about v) */
+#define ATEOS() (v->now >= v->stop)
+#define HAVE(n) (v->stop - v->now >= (n))
+#define NEXT1(c) (!ATEOS() && *v->now == CHR(c))
+#define NEXT2(a,b) (HAVE(2) && *v->now == CHR(a) && *(v->now+1) == CHR(b))
+#define NEXT3(a,b,c) (HAVE(3) && *v->now == CHR(a) && \
+ *(v->now+1) == CHR(b) && \
+ *(v->now+2) == CHR(c))
+#define SET(c) (v->nexttype = (c))
+#define SETV(c, n) (v->nexttype = (c), v->nextvalue = (n))
+#define RET(c) return (SET(c), 1)
+#define RETV(c, n) return (SETV(c, n), 1)
+#define FAILW(e) return (ERR(e), 0) /* ERR does SET(EOS) */
+#define LASTTYPE(t) (v->lasttype == (t))
+
+/* lexical contexts */
+#define L_ERE 1 /* mainline ERE/ARE */
+#define L_BRE 2 /* mainline BRE */
+#define L_Q 3 /* REG_QUOTE */
+#define L_EBND 4 /* ERE/ARE bound */
+#define L_BBND 5 /* BRE bound */
+#define L_BRACK 6 /* brackets */
+#define L_CEL 7 /* collating element */
+#define L_ECL 8 /* equivalence class */
+#define L_CCL 9 /* character class */
+#define INTOCON(c) (v->lexcon = (c))
+#define INCON(con) (v->lexcon == (con))
+
+/* construct pointer past end of chr array */
+#define ENDOF(array) ((array) + sizeof(array)/sizeof(chr))
+
+/*
+ * lexstart - set up lexical stuff, scan leading options
+ */
+static void
+lexstart(struct vars *v)
+{
+ prefixes(v); /* may turn on new type bits etc. */
+ NOERR();
+
+ if (v->cflags®_QUOTE) {
+ assert(!(v->cflags&(REG_ADVANCED|REG_EXPANDED|REG_NEWLINE)));
+ INTOCON(L_Q);
+ } else if (v->cflags®_EXTENDED) {
+ assert(!(v->cflags®_QUOTE));
+ INTOCON(L_ERE);
+ } else {
+ assert(!(v->cflags&(REG_QUOTE|REG_ADVF)));
+ INTOCON(L_BRE);
+ }
+
+ v->nexttype = EMPTY; /* remember we were at the start */
+ next(v); /* set up the first token */
+}
+
+/*
+ * prefixes - implement various special prefixes
+ */
+static void
+prefixes(struct vars *v)
+{
+ /* literal string doesn't get any of this stuff */
+ if (v->cflags®_QUOTE)
+ return;
+
+ /* initial "***" gets special things */
+ if (HAVE(4) && NEXT3('*', '*', '*'))
+ switch (*(v->now + 3)) {
+ case CHR('?'): /* "***?" error, msg shows version */
+ ERR(REG_BADPAT);
+ return; /* proceed no further */
+ break;
+ case CHR('='): /* "***=" shifts to literal string */
+ NOTE(REG_UNONPOSIX);
+ v->cflags |= REG_QUOTE;
+ v->cflags &= ~(REG_ADVANCED|REG_EXPANDED|REG_NEWLINE);
+ v->now += 4;
+ return; /* and there can be no more prefixes */
+ break;
+ case CHR(':'): /* "***:" shifts to AREs */
+ NOTE(REG_UNONPOSIX);
+ v->cflags |= REG_ADVANCED;
+ v->now += 4;
+ break;
+ default: /* otherwise *** is just an error */
+ ERR(REG_BADRPT);
+ return;
+ break;
+ }
+
+ /* BREs and EREs don't get embedded options */
+ if ((v->cflags®_ADVANCED) != REG_ADVANCED)
+ return;
+
+ /* embedded options (AREs only) */
+ if (HAVE(3) && NEXT2('(', '?') && iscalpha(*(v->now + 2))) {
+ NOTE(REG_UNONPOSIX);
+ v->now += 2;
+ for (; !ATEOS() && iscalpha(*v->now); v->now++)
+ switch (*v->now) {
+ case CHR('b'): /* BREs (but why???) */
+ v->cflags &= ~(REG_ADVANCED|REG_QUOTE);
+ break;
+ case CHR('c'): /* case sensitive */
+ v->cflags &= ~REG_ICASE;
+ break;
+ case CHR('e'): /* plain EREs */
+ v->cflags |= REG_EXTENDED;
+ v->cflags &= ~(REG_ADVF|REG_QUOTE);
+ break;
+ case CHR('i'): /* case insensitive */
+ v->cflags |= REG_ICASE;
+ break;
+ case CHR('m'): /* Perloid synonym for n */
+ case CHR('n'): /* \n affects ^ $ . [^ */
+ v->cflags |= REG_NEWLINE;
+ break;
+ case CHR('p'): /* ~Perl, \n affects . [^ */
+ v->cflags |= REG_NLSTOP;
+ v->cflags &= ~REG_NLANCH;
+ break;
+ case CHR('q'): /* literal string */
+ v->cflags |= REG_QUOTE;
+ v->cflags &= ~REG_ADVANCED;
+ break;
+ case CHR('s'): /* single line, \n ordinary */
+ v->cflags &= ~REG_NEWLINE;
+ break;
+ case CHR('t'): /* tight syntax */
+ v->cflags &= ~REG_EXPANDED;
+ break;
+ case CHR('w'): /* weird, \n affects ^ $ only */
+ v->cflags &= ~REG_NLSTOP;
+ v->cflags |= REG_NLANCH;
+ break;
+ case CHR('x'): /* expanded syntax */
+ v->cflags |= REG_EXPANDED;
+ break;
+ default:
+ ERR(REG_BADOPT);
+ return;
+ }
+ if (!NEXT1(')')) {
+ ERR(REG_BADOPT);
+ return;
+ }
+ v->now++;
+ if (v->cflags®_QUOTE)
+ v->cflags &= ~(REG_EXPANDED|REG_NEWLINE);
+ }
+}
+
+/*
+ * lexnest - "call a subroutine", interpolating string at the lexical level
+ *
+ * Note, this is not a very general facility. There are a number of
+ * implicit assumptions about what sorts of strings can be subroutines.
+ */
+static void
+lexnest(struct vars *v,
+ chr *beginp, /* start of interpolation */
+ chr *endp) /* one past end of interpolation */
+{
+ assert(v->savenow == NULL); /* only one level of nesting */
+ v->savenow = v->now;
+ v->savestop = v->stop;
+ v->now = beginp;
+ v->stop = endp;
+}
+
+/*
+ * string constants to interpolate as expansions of things like \d
+ */
+static chr backd[] = { /* \d */
+ CHR('['), CHR('['), CHR(':'),
+ CHR('d'), CHR('i'), CHR('g'), CHR('i'), CHR('t'),
+ CHR(':'), CHR(']'), CHR(']')
+};
+static chr backD[] = { /* \D */
+ CHR('['), CHR('^'), CHR('['), CHR(':'),
+ CHR('d'), CHR('i'), CHR('g'), CHR('i'), CHR('t'),
+ CHR(':'), CHR(']'), CHR(']')
+};
+static chr brbackd[] = { /* \d within brackets */
+ CHR('['), CHR(':'),
+ CHR('d'), CHR('i'), CHR('g'), CHR('i'), CHR('t'),
+ CHR(':'), CHR(']')
+};
+static chr backs[] = { /* \s */
+ CHR('['), CHR('['), CHR(':'),
+ CHR('s'), CHR('p'), CHR('a'), CHR('c'), CHR('e'),
+ CHR(':'), CHR(']'), CHR(']')
+};
+static chr backS[] = { /* \S */
+ CHR('['), CHR('^'), CHR('['), CHR(':'),
+ CHR('s'), CHR('p'), CHR('a'), CHR('c'), CHR('e'),
+ CHR(':'), CHR(']'), CHR(']')
+};
+static chr brbacks[] = { /* \s within brackets */
+ CHR('['), CHR(':'),
+ CHR('s'), CHR('p'), CHR('a'), CHR('c'), CHR('e'),
+ CHR(':'), CHR(']')
+};
+static chr backw[] = { /* \w */
+ CHR('['), CHR('['), CHR(':'),
+ CHR('a'), CHR('l'), CHR('n'), CHR('u'), CHR('m'),
+ CHR(':'), CHR(']'), CHR('_'), CHR(']')
+};
+static chr backW[] = { /* \W */
+ CHR('['), CHR('^'), CHR('['), CHR(':'),
+ CHR('a'), CHR('l'), CHR('n'), CHR('u'), CHR('m'),
+ CHR(':'), CHR(']'), CHR('_'), CHR(']')
+};
+static chr brbackw[] = { /* \w within brackets */
+ CHR('['), CHR(':'),
+ CHR('a'), CHR('l'), CHR('n'), CHR('u'), CHR('m'),
+ CHR(':'), CHR(']'), CHR('_')
+};
+
+/*
+ * lexword - interpolate a bracket expression for word characters
+ * Possibly ought to inquire whether there is a "word" character class.
+ */
+static void
+lexword(struct vars *v)
+{
+ lexnest(v, backw, ENDOF(backw));
+}
+
+/*
+ * next - get next token
+ */
+static int /* 1 normal, 0 failure */
+next(struct vars *v)
+{
+ chr c;
+
+ /* errors yield an infinite sequence of failures */
+ if (ISERR())
+ return 0; /* the error has set nexttype to EOS */
+
+ /* remember flavor of last token */
+ v->lasttype = v->nexttype;
+
+ /* REG_BOSONLY */
+ if (v->nexttype == EMPTY && (v->cflags®_BOSONLY)) {
+ /* at start of a REG_BOSONLY RE */
+ RETV(SBEGIN, 0); /* same as \A */
+ }
+
+ /* if we're nested and we've hit end, return to outer level */
+ if (v->savenow != NULL && ATEOS()) {
+ v->now = v->savenow;
+ v->stop = v->savestop;
+ v->savenow = v->savestop = NULL;
+ }
+
+ /* skip white space etc. if appropriate (not in literal or []) */
+ if (v->cflags®_EXPANDED)
+ switch (v->lexcon) {
+ case L_ERE:
+ case L_BRE:
+ case L_EBND:
+ case L_BBND:
+ skip(v);
+ break;
+ }
+
+ /* handle EOS, depending on context */
+ if (ATEOS()) {
+ switch (v->lexcon) {
+ case L_ERE:
+ case L_BRE:
+ case L_Q:
+ RET(EOS);
+ break;
+ case L_EBND:
+ case L_BBND:
+ FAILW(REG_EBRACE);
+ break;
+ case L_BRACK:
+ case L_CEL:
+ case L_ECL:
+ case L_CCL:
+ FAILW(REG_EBRACK);
+ break;
+ }
+ assert(NOTREACHED);
+ }
+
+ /* okay, time to actually get a character */
+ c = *v->now++;
+
+ /* deal with the easy contexts, punt EREs to code below */
+ switch (v->lexcon) {
+ case L_BRE: /* punt BREs to separate function */
+ return brenext(v, c);
+ break;
+ case L_ERE: /* see below */
+ break;
+ case L_Q: /* literal strings are easy */
+ RETV(PLAIN, c);
+ break;
+ case L_BBND: /* bounds are fairly simple */
+ case L_EBND:
+ switch (c) {
+ case CHR('0'): case CHR('1'): case CHR('2'): case CHR('3'):
+ case CHR('4'): case CHR('5'): case CHR('6'): case CHR('7'):
+ case CHR('8'): case CHR('9'):
+ RETV(DIGIT, (chr)DIGITVAL(c));
+ break;
+ case CHR(','):
+ RET(',');
+ break;
+ case CHR('}'): /* ERE bound ends with } */
+ if (INCON(L_EBND)) {
+ INTOCON(L_ERE);
+ if ((v->cflags®_ADVF) && NEXT1('?')) {
+ v->now++;
+ NOTE(REG_UNONPOSIX);
+ RETV('}', 0);
+ }
+ RETV('}', 1);
+ } else
+ FAILW(REG_BADBR);
+ break;
+ case CHR('\\'): /* BRE bound ends with \} */
+ if (INCON(L_BBND) && NEXT1('}')) {
+ v->now++;
+ INTOCON(L_BRE);
+ RET('}');
+ } else
+ FAILW(REG_BADBR);
+ break;
+ default:
+ FAILW(REG_BADBR);
+ break;
+ }
+ assert(NOTREACHED);
+ break;
+ case L_BRACK: /* brackets are not too hard */
+ switch (c) {
+ case CHR(']'):
+ if (LASTTYPE('['))
+ RETV(PLAIN, c);
+ else {
+ INTOCON((v->cflags®_EXTENDED) ?
+ L_ERE : L_BRE);
+ RET(']');
+ }
+ break;
+ case CHR('\\'):
+ NOTE(REG_UBBS);
+ if (!(v->cflags®_ADVF))
+ RETV(PLAIN, c);
+ NOTE(REG_UNONPOSIX);
+ if (ATEOS())
+ FAILW(REG_EESCAPE);
+ (DISCARD)lexescape(v);
+ switch (v->nexttype) { /* not all escapes okay here */
+ case PLAIN:
+ return 1;
+ break;
+ case CCLASS:
+ switch (v->nextvalue) {
+ case 'd':
+ lexnest(v, brbackd, ENDOF(brbackd));
+ break;
+ case 's':
+ lexnest(v, brbacks, ENDOF(brbacks));
+ break;
+ case 'w':
+ lexnest(v, brbackw, ENDOF(brbackw));
+ break;
+ default:
+ FAILW(REG_EESCAPE);
+ break;
+ }
+ /* lexnest done, back up and try again */
+ v->nexttype = v->lasttype;
+ return next(v);
+ break;
+ }
+ /* not one of the acceptable escapes */
+ FAILW(REG_EESCAPE);
+ break;
+ case CHR('-'):
+ if (LASTTYPE('[') || NEXT1(']'))
+ RETV(PLAIN, c);
+ else
+ RETV(RANGE, c);
+ break;
+ case CHR('['):
+ if (ATEOS())
+ FAILW(REG_EBRACK);
+ switch (*v->now++) {
+ case CHR('.'):
+ INTOCON(L_CEL);
+ /* might or might not be locale-specific */
+ RET(COLLEL);
+ break;
+ case CHR('='):
+ INTOCON(L_ECL);
+ NOTE(REG_ULOCALE);
+ RET(ECLASS);
+ break;
+ case CHR(':'):
+ INTOCON(L_CCL);
+ NOTE(REG_ULOCALE);
+ RET(CCLASS);
+ break;
+ default: /* oops */
+ v->now--;
+ RETV(PLAIN, c);
+ break;
+ }
+ assert(NOTREACHED);
+ break;
+ default:
+ RETV(PLAIN, c);
+ break;
+ }
+ assert(NOTREACHED);
+ break;
+ case L_CEL: /* collating elements are easy */
+ if (c == CHR('.') && NEXT1(']')) {
+ v->now++;
+ INTOCON(L_BRACK);
+ RETV(END, '.');
+ } else
+ RETV(PLAIN, c);
+ break;
+ case L_ECL: /* ditto equivalence classes */
+ if (c == CHR('=') && NEXT1(']')) {
+ v->now++;
+ INTOCON(L_BRACK);
+ RETV(END, '=');
+ } else
+ RETV(PLAIN, c);
+ break;
+ case L_CCL: /* ditto character classes */
+ if (c == CHR(':') && NEXT1(']')) {
+ v->now++;
+ INTOCON(L_BRACK);
+ RETV(END, ':');
+ } else
+ RETV(PLAIN, c);
+ break;
+ default:
+ assert(NOTREACHED);
+ break;
+ }
+
+ /* that got rid of everything except EREs and AREs */
+ assert(INCON(L_ERE));
+
+ /* deal with EREs and AREs, except for backslashes */
+ switch (c) {
+ case CHR('|'):
+ RET('|');
+ break;
+ case CHR('*'):
+ if ((v->cflags®_ADVF) && NEXT1('?')) {
+ v->now++;
+ NOTE(REG_UNONPOSIX);
+ RETV('*', 0);
+ }
+ RETV('*', 1);
+ break;
+ case CHR('+'):
+ if ((v->cflags®_ADVF) && NEXT1('?')) {
+ v->now++;
+ NOTE(REG_UNONPOSIX);
+ RETV('+', 0);
+ }
+ RETV('+', 1);
+ break;
+ case CHR('?'):
+ if ((v->cflags®_ADVF) && NEXT1('?')) {
+ v->now++;
+ NOTE(REG_UNONPOSIX);
+ RETV('?', 0);
+ }
+ RETV('?', 1);
+ break;
+ case CHR('{'): /* bounds start or plain character */
+ if (v->cflags®_EXPANDED)
+ skip(v);
+ if (ATEOS() || !iscdigit(*v->now)) {
+ NOTE(REG_UBRACES);
+ NOTE(REG_UUNSPEC);
+ RETV(PLAIN, c);
+ } else {
+ NOTE(REG_UBOUNDS);
+ INTOCON(L_EBND);
+ RET('{');
+ }
+ assert(NOTREACHED);
+ break;
+ case CHR('('): /* parenthesis, or advanced extension */
+ if ((v->cflags®_ADVF) && NEXT1('?')) {
+ NOTE(REG_UNONPOSIX);
+ v->now++;
+ switch (*v->now++) {
+ case CHR(':'): /* non-capturing paren */
+ RETV('(', 0);
+ break;
+ case CHR('#'): /* comment */
+ while (!ATEOS() && *v->now != CHR(')'))
+ v->now++;
+ if (!ATEOS())
+ v->now++;
+ assert(v->nexttype == v->lasttype);
+ return next(v);
+ break;
+ case CHR('='): /* positive lookahead */
+ NOTE(REG_ULOOKAHEAD);
+ RETV(LACON, 1);
+ break;
+ case CHR('!'): /* negative lookahead */
+ NOTE(REG_ULOOKAHEAD);
+ RETV(LACON, 0);
+ break;
+ default:
+ FAILW(REG_BADRPT);
+ break;
+ }
+ assert(NOTREACHED);
+ }
+ if (v->cflags®_NOSUB)
+ RETV('(', 0); /* all parens non-capturing */
+ else
+ RETV('(', 1);
+ break;
+ case CHR(')'):
+ if (LASTTYPE('(')) {
+ NOTE(REG_UUNSPEC);
+ }
+ RETV(')', c);
+ break;
+ case CHR('['): /* easy except for [[:<:]] and [[:>:]] */
+ if (HAVE(6) && *(v->now+0) == CHR('[') &&
+ *(v->now+1) == CHR(':') &&
+ (*(v->now+2) == CHR('<') ||
+ *(v->now+2) == CHR('>')) &&
+ *(v->now+3) == CHR(':') &&
+ *(v->now+4) == CHR(']') &&
+ *(v->now+5) == CHR(']')) {
+ c = *(v->now+2);
+ v->now += 6;
+ NOTE(REG_UNONPOSIX);
+ RET((c == CHR('<')) ? '<' : '>');
+ }
+ INTOCON(L_BRACK);
+ if (NEXT1('^')) {
+ v->now++;
+ RETV('[', 0);
+ }
+ RETV('[', 1);
+ break;
+ case CHR('.'):
+ RET('.');
+ break;
+ case CHR('^'):
+ RET('^');
+ break;
+ case CHR('$'):
+ RET('$');
+ break;
+ case CHR('\\'): /* mostly punt backslashes to code below */
+ if (ATEOS())
+ FAILW(REG_EESCAPE);
+ break;
+ default: /* ordinary character */
+ RETV(PLAIN, c);
+ break;
+ }
+
+ /* ERE/ARE backslash handling; backslash already eaten */
+ assert(!ATEOS());
+ if (!(v->cflags®_ADVF)) { /* only AREs have non-trivial escapes */
+ if (iscalnum(*v->now)) {
+ NOTE(REG_UBSALNUM);
+ NOTE(REG_UUNSPEC);
+ }
+ RETV(PLAIN, *v->now++);
+ }
+ (DISCARD)lexescape(v);
+ if (ISERR())
+ FAILW(REG_EESCAPE);
+ if (v->nexttype == CCLASS) { /* fudge at lexical level */
+ switch (v->nextvalue) {
+ case 'd': lexnest(v, backd, ENDOF(backd)); break;
+ case 'D': lexnest(v, backD, ENDOF(backD)); break;
+ case 's': lexnest(v, backs, ENDOF(backs)); break;
+ case 'S': lexnest(v, backS, ENDOF(backS)); break;
+ case 'w': lexnest(v, backw, ENDOF(backw)); break;
+ case 'W': lexnest(v, backW, ENDOF(backW)); break;
+ default:
+ assert(NOTREACHED);
+ FAILW(REG_ASSERT);
+ break;
+ }
+ /* lexnest done, back up and try again */
+ v->nexttype = v->lasttype;
+ return next(v);
+ }
+ /* otherwise, lexescape has already done the work */
+ return !ISERR();
+}
+
+/*
+ * lexescape - parse an ARE backslash escape (backslash already eaten)
+ * Note slightly nonstandard use of the CCLASS type code.
+ */
+static int /* not actually used, but convenient for RETV */
+lexescape(struct vars *v)
+{
+ chr c;
+ static chr alert[] = {
+ CHR('a'), CHR('l'), CHR('e'), CHR('r'), CHR('t')
+ };
+ static chr esc[] = {
+ CHR('E'), CHR('S'), CHR('C')
+ };
+ chr *save;
+
+ assert(v->cflags®_ADVF);
+
+ assert(!ATEOS());
+ c = *v->now++;
+ if (!iscalnum(c))
+ RETV(PLAIN, c);
+
+ NOTE(REG_UNONPOSIX);
+ switch (c) {
+ case CHR('a'):
+ RETV(PLAIN, chrnamed(v, alert, ENDOF(alert), CHR('\007')));
+ break;
+ case CHR('A'):
+ RETV(SBEGIN, 0);
+ break;
+ case CHR('b'):
+ RETV(PLAIN, CHR('\b'));
+ break;
+ case CHR('B'):
+ RETV(PLAIN, CHR('\\'));
+ break;
+ case CHR('c'):
+ NOTE(REG_UUNPORT);
+ if (ATEOS())
+ FAILW(REG_EESCAPE);
+ RETV(PLAIN, (chr)(*v->now++ & 037));
+ break;
+ case CHR('d'):
+ NOTE(REG_ULOCALE);
+ RETV(CCLASS, 'd');
+ break;
+ case CHR('D'):
+ NOTE(REG_ULOCALE);
+ RETV(CCLASS, 'D');
+ break;
+ case CHR('e'):
+ NOTE(REG_UUNPORT);
+ RETV(PLAIN, chrnamed(v, esc, ENDOF(esc), CHR('\033')));
+ break;
+ case CHR('f'):
+ RETV(PLAIN, CHR('\f'));
+ break;
+ case CHR('m'):
+ RET('<');
+ break;
+ case CHR('M'):
+ RET('>');
+ break;
+ case CHR('n'):
+ RETV(PLAIN, CHR('\n'));
+ break;
+ case CHR('r'):
+ RETV(PLAIN, CHR('\r'));
+ break;
+ case CHR('s'):
+ NOTE(REG_ULOCALE);
+ RETV(CCLASS, 's');
+ break;
+ case CHR('S'):
+ NOTE(REG_ULOCALE);
+ RETV(CCLASS, 'S');
+ break;
+ case CHR('t'):
+ RETV(PLAIN, CHR('\t'));
+ break;
+ case CHR('u'):
+ c = lexdigits(v, 16, 4, 4);
+ if (ISERR())
+ FAILW(REG_EESCAPE);
+ RETV(PLAIN, c);
+ break;
+ case CHR('U'):
+ c = lexdigits(v, 16, 8, 8);
+ if (ISERR())
+ FAILW(REG_EESCAPE);
+ RETV(PLAIN, c);
+ break;
+ case CHR('v'):
+ RETV(PLAIN, CHR('\v'));
+ break;
+ case CHR('w'):
+ NOTE(REG_ULOCALE);
+ RETV(CCLASS, 'w');
+ break;
+ case CHR('W'):
+ NOTE(REG_ULOCALE);
+ RETV(CCLASS, 'W');
+ break;
+ case CHR('x'):
+ NOTE(REG_UUNPORT);
+ c = lexdigits(v, 16, 1, 255); /* REs >255 long outside spec */
+ if (ISERR())
+ FAILW(REG_EESCAPE);
+ RETV(PLAIN, c);
+ break;
+ case CHR('y'):
+ NOTE(REG_ULOCALE);
+ RETV(WBDRY, 0);
+ break;
+ case CHR('Y'):
+ NOTE(REG_ULOCALE);
+ RETV(NWBDRY, 0);
+ break;
+ case CHR('Z'):
+ RETV(SEND, 0);
+ break;
+ case CHR('1'): case CHR('2'): case CHR('3'): case CHR('4'):
+ case CHR('5'): case CHR('6'): case CHR('7'): case CHR('8'):
+ case CHR('9'):
+ save = v->now;
+ v->now--; /* put first digit back */
+ c = lexdigits(v, 10, 1, 255); /* REs >255 long outside spec */
+ if (ISERR())
+ FAILW(REG_EESCAPE);
+ /* ugly heuristic (first test is "exactly 1 digit?") */
+ if (v->now - save == 0 || (int)c <= v->nsubexp) {
+ NOTE(REG_UBACKREF);
+ RETV(BACKREF, (chr)c);
+ }
+ /* oops, doesn't look like it's a backref after all... */
+ v->now = save;
+ /* and fall through into octal number */
+ case CHR('0'):
+ NOTE(REG_UUNPORT);
+ v->now--; /* put first digit back */
+ c = lexdigits(v, 8, 1, 3);
+ if (ISERR())
+ FAILW(REG_EESCAPE);
+ RETV(PLAIN, c);
+ break;
+ default:
+ assert(iscalpha(c));
+ FAILW(REG_EESCAPE); /* unknown alphabetic escape */
+ break;
+ }
+ assert(NOTREACHED);
+}
+
+/*
+ * lexdigits - slurp up digits and return chr value
+ */
+static chr /* chr value; errors signalled via ERR */
+lexdigits(struct vars *v,
+ int base,
+ int minlen,
+ int maxlen)
+{
+ uchr n; /* unsigned to avoid overflow misbehavior */
+ int len;
+ chr c;
+ int d;
+ const uchr ub = (uchr) base;
+
+ n = 0;
+ for (len = 0; len < maxlen && !ATEOS(); len++) {
+ c = *v->now++;
+ switch (c) {
+ case CHR('0'): case CHR('1'): case CHR('2'): case CHR('3'):
+ case CHR('4'): case CHR('5'): case CHR('6'): case CHR('7'):
+ case CHR('8'): case CHR('9'):
+ d = DIGITVAL(c);
+ break;
+ case CHR('a'): case CHR('A'): d = 10; break;
+ case CHR('b'): case CHR('B'): d = 11; break;
+ case CHR('c'): case CHR('C'): d = 12; break;
+ case CHR('d'): case CHR('D'): d = 13; break;
+ case CHR('e'): case CHR('E'): d = 14; break;
+ case CHR('f'): case CHR('F'): d = 15; break;
+ default:
+ v->now--; /* oops, not a digit at all */
+ d = -1;
+ break;
+ }
+
+ if (d >= base) { /* not a plausible digit */
+ v->now--;
+ d = -1;
+ }
+ if (d < 0)
+ break; /* NOTE BREAK OUT */
+ n = n*ub + (uchr)d;
+ }
+ if (len < minlen)
+ ERR(REG_EESCAPE);
+
+ return (chr)n;
+}
+
+/*
+ * brenext - get next BRE token
+ *
+ * This is much like EREs except for all the stupid backslashes and the
+ * context-dependency of some things.
+ */
+static int /* 1 normal, 0 failure */
+brenext(struct vars *v,
+ chr pc)
+{
+ chr c = (chr)pc;
+
+ switch (c) {
+ case CHR('*'):
+ if (LASTTYPE(EMPTY) || LASTTYPE('(') || LASTTYPE('^'))
+ RETV(PLAIN, c);
+ RET('*');
+ break;
+ case CHR('['):
+ if (HAVE(6) && *(v->now+0) == CHR('[') &&
+ *(v->now+1) == CHR(':') &&
+ (*(v->now+2) == CHR('<') ||
+ *(v->now+2) == CHR('>')) &&
+ *(v->now+3) == CHR(':') &&
+ *(v->now+4) == CHR(']') &&
+ *(v->now+5) == CHR(']')) {
+ c = *(v->now+2);
+ v->now += 6;
+ NOTE(REG_UNONPOSIX);
+ RET((c == CHR('<')) ? '<' : '>');
+ }
+ INTOCON(L_BRACK);
+ if (NEXT1('^')) {
+ v->now++;
+ RETV('[', 0);
+ }
+ RETV('[', 1);
+ break;
+ case CHR('.'):
+ RET('.');
+ break;
+ case CHR('^'):
+ if (LASTTYPE(EMPTY))
+ RET('^');
+ if (LASTTYPE('(')) {
+ NOTE(REG_UUNSPEC);
+ RET('^');
+ }
+ RETV(PLAIN, c);
+ break;
+ case CHR('$'):
+ if (v->cflags®_EXPANDED)
+ skip(v);
+ if (ATEOS())
+ RET('$');
+ if (NEXT2('\\', ')')) {
+ NOTE(REG_UUNSPEC);
+ RET('$');
+ }
+ RETV(PLAIN, c);
+ break;
+ case CHR('\\'):
+ break; /* see below */
+ default:
+ RETV(PLAIN, c);
+ break;
+ }
+
+ assert(c == CHR('\\'));
+
+ if (ATEOS())
+ FAILW(REG_EESCAPE);
+
+ c = *v->now++;
+ switch (c) {
+ case CHR('{'):
+ INTOCON(L_BBND);
+ NOTE(REG_UBOUNDS);
+ RET('{');
+ break;
+ case CHR('('):
+ RETV('(', 1);
+ break;
+ case CHR(')'):
+ RETV(')', c);
+ break;
+ case CHR('<'):
+ NOTE(REG_UNONPOSIX);
+ RET('<');
+ break;
+ case CHR('>'):
+ NOTE(REG_UNONPOSIX);
+ RET('>');
+ break;
+ case CHR('1'): case CHR('2'): case CHR('3'): case CHR('4'):
+ case CHR('5'): case CHR('6'): case CHR('7'): case CHR('8'):
+ case CHR('9'):
+ NOTE(REG_UBACKREF);
+ RETV(BACKREF, (chr)DIGITVAL(c));
+ break;
+ default:
+ if (iscalnum(c)) {
+ NOTE(REG_UBSALNUM);
+ NOTE(REG_UUNSPEC);
+ }
+ RETV(PLAIN, c);
+ break;
+ }
+
+ assert(NOTREACHED);
+}
+
+/*
+ * skip - skip white space and comments in expanded form
+ */
+static void
+skip(struct vars *v)
+{
+ chr *start = v->now;
+
+ assert(v->cflags®_EXPANDED);
+
+ for (;;) {
+ while (!ATEOS() && iscspace(*v->now))
+ v->now++;
+ if (ATEOS() || *v->now != CHR('#'))
+ break; /* NOTE BREAK OUT */
+ assert(NEXT1('#'));
+ while (!ATEOS() && *v->now != CHR('\n'))
+ v->now++;
+ /* leave the newline to be picked up by the iscspace loop */
+ }
+
+ if (v->now != start)
+ NOTE(REG_UNONPOSIX);
+}
+
+/*
+ * newline - return the chr for a newline
+ *
+ * This helps confine use of CHR to this source file.
+ */
+static chr
+newline(void)
+{
+ return CHR('\n');
+}
+
+/*
+ * chrnamed - return the chr known by a given (chr string) name
+ *
+ * The code is a bit clumsy, but this routine gets only such specialized
+ * use that it hardly matters.
+ */
+static chr
+chrnamed(struct vars *v,
+ chr *startp, /* start of name */
+ chr *endp, /* just past end of name */
+ chr lastresort) /* what to return if name lookup fails */
+{
+ celt c;
+ int errsave;
+ int e;
+ struct cvec *cv;
+
+ errsave = v->err;
+ v->err = 0;
+ c = element(v, startp, endp);
+ e = v->err;
+ v->err = errsave;
+
+ if (e != 0)
+ return (chr)lastresort;
+
+ cv = range(v, c, c, 0);
+ if (cv->nchrs == 0)
+ return (chr)lastresort;
+ return cv->chrs[0];
+}
--- /dev/null
+/*
+ * regc_locale.c --
+ *
+ * This file contains locale-specific regexp routines.
+ * This file is #included by regcomp.c.
+ *
+ * Copyright (c) 1998 by Scriptics Corporation.
+ *
+ * This software is copyrighted by the Regents of the University of
+ * California, Sun Microsystems, Inc., Scriptics Corporation, ActiveState
+ * Corporation and other parties. The following terms apply to all files
+ * associated with the software unless explicitly disclaimed in
+ * individual files.
+ *
+ * The authors hereby grant permission to use, copy, modify, distribute,
+ * and license this software and its documentation for any purpose, provided
+ * that existing copyright notices are retained in all copies and that this
+ * notice is included verbatim in any distributions. No written agreement,
+ * license, or royalty fee is required for any of the authorized uses.
+ * Modifications to this software may be copyrighted by their authors
+ * and need not follow the licensing terms described here, provided that
+ * the new terms are clearly indicated on the first page of each file where
+ * they apply.
+ *
+ * IN NO EVENT SHALL THE AUTHORS OR DISTRIBUTORS BE LIABLE TO ANY PARTY
+ * FOR DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES
+ * ARISING OUT OF THE USE OF THIS SOFTWARE, ITS DOCUMENTATION, OR ANY
+ * DERIVATIVES THEREOF, EVEN IF THE AUTHORS HAVE BEEN ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE.
+ *
+ * THE AUTHORS AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
+ * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE, AND NON-INFRINGEMENT. THIS SOFTWARE
+ * IS PROVIDED ON AN "AS IS" BASIS, AND THE AUTHORS AND DISTRIBUTORS HAVE
+ * NO OBLIGATION TO PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR
+ * MODIFICATIONS.
+ *
+ * GOVERNMENT USE: If you are acquiring this software on behalf of the
+ * U.S. government, the Government shall have only "Restricted Rights"
+ * in the software and related documentation as defined in the Federal
+ * Acquisition Regulations (FARs) in Clause 52.227.19 (c) (2). If you
+ * are acquiring the software on behalf of the Department of Defense, the
+ * software shall be classified as "Commercial Computer Software" and the
+ * Government shall have only "Restricted Rights" as defined in Clause
+ * 252.227-7013 (c) (1) of DFARs. Notwithstanding the foregoing, the
+ * authors grant the U.S. Government and others acting in its behalf
+ * permission to use and distribute the software in accordance with the
+ * terms specified in this license.
+ *
+ * $Header: /cvsroot/pgsql/src/backend/regex/regc_locale.c,v 1.1 2003/02/05 17:41:32 tgl Exp $
+ */
+
+/* ASCII character-name table */
+
+static struct cname {
+ char *name;
+ char code;
+} cnames[] = {
+ {"NUL", '\0'},
+ {"SOH", '\001'},
+ {"STX", '\002'},
+ {"ETX", '\003'},
+ {"EOT", '\004'},
+ {"ENQ", '\005'},
+ {"ACK", '\006'},
+ {"BEL", '\007'},
+ {"alert", '\007'},
+ {"BS", '\010'},
+ {"backspace", '\b'},
+ {"HT", '\011'},
+ {"tab", '\t'},
+ {"LF", '\012'},
+ {"newline", '\n'},
+ {"VT", '\013'},
+ {"vertical-tab", '\v'},
+ {"FF", '\014'},
+ {"form-feed", '\f'},
+ {"CR", '\015'},
+ {"carriage-return", '\r'},
+ {"SO", '\016'},
+ {"SI", '\017'},
+ {"DLE", '\020'},
+ {"DC1", '\021'},
+ {"DC2", '\022'},
+ {"DC3", '\023'},
+ {"DC4", '\024'},
+ {"NAK", '\025'},
+ {"SYN", '\026'},
+ {"ETB", '\027'},
+ {"CAN", '\030'},
+ {"EM", '\031'},
+ {"SUB", '\032'},
+ {"ESC", '\033'},
+ {"IS4", '\034'},
+ {"FS", '\034'},
+ {"IS3", '\035'},
+ {"GS", '\035'},
+ {"IS2", '\036'},
+ {"RS", '\036'},
+ {"IS1", '\037'},
+ {"US", '\037'},
+ {"space", ' '},
+ {"exclamation-mark",'!'},
+ {"quotation-mark", '"'},
+ {"number-sign", '#'},
+ {"dollar-sign", '$'},
+ {"percent-sign", '%'},
+ {"ampersand", '&'},
+ {"apostrophe", '\''},
+ {"left-parenthesis",'('},
+ {"right-parenthesis", ')'},
+ {"asterisk", '*'},
+ {"plus-sign", '+'},
+ {"comma", ','},
+ {"hyphen", '-'},
+ {"hyphen-minus", '-'},
+ {"period", '.'},
+ {"full-stop", '.'},
+ {"slash", '/'},
+ {"solidus", '/'},
+ {"zero", '0'},
+ {"one", '1'},
+ {"two", '2'},
+ {"three", '3'},
+ {"four", '4'},
+ {"five", '5'},
+ {"six", '6'},
+ {"seven", '7'},
+ {"eight", '8'},
+ {"nine", '9'},
+ {"colon", ':'},
+ {"semicolon", ';'},
+ {"less-than-sign", '<'},
+ {"equals-sign", '='},
+ {"greater-than-sign", '>'},
+ {"question-mark", '?'},
+ {"commercial-at", '@'},
+ {"left-square-bracket", '['},
+ {"backslash", '\\'},
+ {"reverse-solidus", '\\'},
+ {"right-square-bracket", ']'},
+ {"circumflex", '^'},
+ {"circumflex-accent", '^'},
+ {"underscore", '_'},
+ {"low-line", '_'},
+ {"grave-accent", '`'},
+ {"left-brace", '{'},
+ {"left-curly-bracket", '{'},
+ {"vertical-line", '|'},
+ {"right-brace", '}'},
+ {"right-curly-bracket", '}'},
+ {"tilde", '~'},
+ {"DEL", '\177'},
+ {NULL, 0}
+};
+
+/*
+ * some ctype functions with non-ascii-char guard
+ */
+static int
+pg_isdigit(pg_wchar c)
+{
+ return (c >= 0 && c <= UCHAR_MAX && isdigit((unsigned char) c));
+}
+
+static int
+pg_isalpha(pg_wchar c)
+{
+ return (c >= 0 && c <= UCHAR_MAX && isalpha((unsigned char) c));
+}
+
+static int
+pg_isalnum(pg_wchar c)
+{
+ return (c >= 0 && c <= UCHAR_MAX && isalnum((unsigned char) c));
+}
+
+static int
+pg_isupper(pg_wchar c)
+{
+ return (c >= 0 && c <= UCHAR_MAX && isupper((unsigned char) c));
+}
+
+static int
+pg_islower(pg_wchar c)
+{
+ return (c >= 0 && c <= UCHAR_MAX && islower((unsigned char) c));
+}
+
+static int
+pg_isgraph(pg_wchar c)
+{
+ return (c >= 0 && c <= UCHAR_MAX && isgraph((unsigned char) c));
+}
+
+static int
+pg_ispunct(pg_wchar c)
+{
+ return (c >= 0 && c <= UCHAR_MAX && ispunct((unsigned char) c));
+}
+
+static int
+pg_isspace(pg_wchar c)
+{
+ return (c >= 0 && c <= UCHAR_MAX && isspace((unsigned char) c));
+}
+
+static pg_wchar
+pg_toupper(pg_wchar c)
+{
+ if (c >= 0 && c <= UCHAR_MAX)
+ return toupper((unsigned char) c);
+ return c;
+}
+
+static pg_wchar
+pg_tolower(pg_wchar c)
+{
+ if (c >= 0 && c <= UCHAR_MAX)
+ return tolower((unsigned char) c);
+ return c;
+}
+
+
+/*
+ * nmcces - how many distinct MCCEs are there?
+ */
+static int
+nmcces(struct vars *v)
+{
+ /*
+ * No multi-character collating elements defined at the moment.
+ */
+ return 0;
+}
+
+/*
+ * nleaders - how many chrs can be first chrs of MCCEs?
+ */
+static int
+nleaders(struct vars *v)
+{
+ return 0;
+}
+
+/*
+ * allmcces - return a cvec with all the MCCEs of the locale
+ */
+static struct cvec *
+allmcces(struct vars *v, /* context */
+ struct cvec *cv) /* this is supposed to have enough room */
+{
+ return clearcvec(cv);
+}
+
+/*
+ * element - map collating-element name to celt
+ */
+static celt
+element(struct vars *v, /* context */
+ chr *startp, /* points to start of name */
+ chr *endp) /* points just past end of name */
+{
+ struct cname *cn;
+ size_t len;
+
+ /* generic: one-chr names stand for themselves */
+ assert(startp < endp);
+ len = endp - startp;
+ if (len == 1) {
+ return *startp;
+ }
+
+ NOTE(REG_ULOCALE);
+
+ /* search table */
+ for (cn=cnames; cn->name!=NULL; cn++) {
+ if (strlen(cn->name)==len &&
+ pg_char_and_wchar_strncmp(cn->name, startp, len)==0) {
+ break; /* NOTE BREAK OUT */
+ }
+ }
+ if (cn->name != NULL) {
+ return CHR(cn->code);
+ }
+
+ /* couldn't find it */
+ ERR(REG_ECOLLATE);
+ return 0;
+}
+
+/*
+ * range - supply cvec for a range, including legality check
+ */
+static struct cvec *
+range(struct vars *v, /* context */
+ celt a, /* range start */
+ celt b, /* range end, might equal a */
+ int cases) /* case-independent? */
+{
+ int nchrs;
+ struct cvec *cv;
+ celt c, lc, uc;
+
+ if (a != b && !before(a, b)) {
+ ERR(REG_ERANGE);
+ return NULL;
+ }
+
+ if (!cases) { /* easy version */
+ cv = getcvec(v, 0, 1, 0);
+ NOERRN();
+ addrange(cv, a, b);
+ return cv;
+ }
+
+ /*
+ * When case-independent, it's hard to decide when cvec ranges are
+ * usable, so for now at least, we won't try. We allocate enough
+ * space for two case variants plus a little extra for the two
+ * title case variants.
+ */
+
+ nchrs = (b - a + 1)*2 + 4;
+
+ cv = getcvec(v, nchrs, 0, 0);
+ NOERRN();
+
+ for (c=a; c<=b; c++) {
+ addchr(cv, c);
+ lc = pg_tolower((chr)c);
+ if (c != lc) {
+ addchr(cv, lc);
+ }
+ uc = pg_toupper((chr)c);
+ if (c != uc) {
+ addchr(cv, uc);
+ }
+ }
+
+ return cv;
+}
+
+/*
+ * before - is celt x before celt y, for purposes of range legality?
+ */
+static int /* predicate */
+before(celt x, celt y)
+{
+ /* trivial because no MCCEs */
+ if (x < y) {
+ return 1;
+ }
+ return 0;
+}
+
+/*
+ * eclass - supply cvec for an equivalence class
+ * Must include case counterparts on request.
+ */
+static struct cvec *
+eclass(struct vars *v, /* context */
+ celt c, /* Collating element representing
+ * the equivalence class. */
+ int cases) /* all cases? */
+{
+ struct cvec *cv;
+
+ /* crude fake equivalence class for testing */
+ if ((v->cflags®_FAKE) && c == 'x') {
+ cv = getcvec(v, 4, 0, 0);
+ addchr(cv, (chr)'x');
+ addchr(cv, (chr)'y');
+ if (cases) {
+ addchr(cv, (chr)'X');
+ addchr(cv, (chr)'Y');
+ }
+ return cv;
+ }
+
+ /* otherwise, none */
+ if (cases) {
+ return allcases(v, c);
+ }
+ cv = getcvec(v, 1, 0, 0);
+ assert(cv != NULL);
+ addchr(cv, (chr)c);
+ return cv;
+}
+
+/*
+ * cclass - supply cvec for a character class
+ *
+ * Must include case counterparts on request.
+ */
+static struct cvec *
+cclass(struct vars *v, /* context */
+ chr *startp, /* where the name starts */
+ chr *endp, /* just past the end of the name */
+ int cases) /* case-independent? */
+{
+ size_t len;
+ struct cvec *cv = NULL;
+ char **namePtr;
+ int i, index;
+
+ /*
+ * The following arrays define the valid character class names.
+ */
+
+ static char *classNames[] = {
+ "alnum", "alpha", "ascii", "blank", "cntrl", "digit", "graph",
+ "lower", "print", "punct", "space", "upper", "xdigit", NULL
+ };
+
+ enum classes {
+ CC_ALNUM, CC_ALPHA, CC_ASCII, CC_BLANK, CC_CNTRL, CC_DIGIT, CC_GRAPH,
+ CC_LOWER, CC_PRINT, CC_PUNCT, CC_SPACE, CC_UPPER, CC_XDIGIT
+ };
+
+ /*
+ * Map the name to the corresponding enumerated value.
+ */
+ len = endp - startp;
+ index = -1;
+ for (namePtr=classNames,i=0 ; *namePtr!=NULL ; namePtr++,i++) {
+ if (strlen(*namePtr) == len &&
+ pg_char_and_wchar_strncmp(*namePtr, startp, len) == 0) {
+ index = i;
+ break;
+ }
+ }
+ if (index == -1) {
+ ERR(REG_ECTYPE);
+ return NULL;
+ }
+
+ /*
+ * Remap lower and upper to alpha if the match is case insensitive.
+ */
+
+ if (cases &&
+ ((enum classes) index == CC_LOWER ||
+ (enum classes) index == CC_UPPER))
+ index = (int) CC_ALPHA;
+
+ /*
+ * Now compute the character class contents.
+ *
+ * For the moment, assume that only char codes < 256 can be in these
+ * classes.
+ */
+
+ switch((enum classes) index) {
+ case CC_PRINT:
+ case CC_ALNUM:
+ cv = getcvec(v, UCHAR_MAX, 1, 0);
+ if (cv) {
+ for (i=0 ; i<= UCHAR_MAX ; i++) {
+ if (pg_isalpha((chr) i))
+ addchr(cv, (chr) i);
+ }
+ addrange(cv, (chr) '0', (chr) '9');
+ }
+ break;
+ case CC_ALPHA:
+ cv = getcvec(v, UCHAR_MAX, 0, 0);
+ if (cv) {
+ for (i=0 ; i<= UCHAR_MAX ; i++) {
+ if (pg_isalpha((chr) i))
+ addchr(cv, (chr) i);
+ }
+ }
+ break;
+ case CC_ASCII:
+ cv = getcvec(v, 0, 1, 0);
+ if (cv) {
+ addrange(cv, 0, 0x7f);
+ }
+ break;
+ case CC_BLANK:
+ cv = getcvec(v, 2, 0, 0);
+ addchr(cv, '\t');
+ addchr(cv, ' ');
+ break;
+ case CC_CNTRL:
+ cv = getcvec(v, 0, 2, 0);
+ addrange(cv, 0x0, 0x1f);
+ addrange(cv, 0x7f, 0x9f);
+ break;
+ case CC_DIGIT:
+ cv = getcvec(v, 0, 1, 0);
+ if (cv) {
+ addrange(cv, (chr) '0', (chr) '9');
+ }
+ break;
+ case CC_PUNCT:
+ cv = getcvec(v, UCHAR_MAX, 0, 0);
+ if (cv) {
+ for (i=0 ; i<= UCHAR_MAX ; i++) {
+ if (pg_ispunct((chr) i))
+ addchr(cv, (chr) i);
+ }
+ }
+ break;
+ case CC_XDIGIT:
+ cv = getcvec(v, 0, 3, 0);
+ if (cv) {
+ addrange(cv, '0', '9');
+ addrange(cv, 'a', 'f');
+ addrange(cv, 'A', 'F');
+ }
+ break;
+ case CC_SPACE:
+ cv = getcvec(v, UCHAR_MAX, 0, 0);
+ if (cv) {
+ for (i=0 ; i<= UCHAR_MAX ; i++) {
+ if (pg_isspace((chr) i))
+ addchr(cv, (chr) i);
+ }
+ }
+ break;
+ case CC_LOWER:
+ cv = getcvec(v, UCHAR_MAX, 0, 0);
+ if (cv) {
+ for (i=0 ; i<= UCHAR_MAX ; i++) {
+ if (pg_islower((chr) i))
+ addchr(cv, (chr) i);
+ }
+ }
+ break;
+ case CC_UPPER:
+ cv = getcvec(v, UCHAR_MAX, 0, 0);
+ if (cv) {
+ for (i=0 ; i<= UCHAR_MAX ; i++) {
+ if (pg_isupper((chr) i))
+ addchr(cv, (chr) i);
+ }
+ }
+ break;
+ case CC_GRAPH:
+ cv = getcvec(v, UCHAR_MAX, 0, 0);
+ if (cv) {
+ for (i=0 ; i<= UCHAR_MAX ; i++) {
+ if (pg_isgraph((chr) i))
+ addchr(cv, (chr) i);
+ }
+ }
+ break;
+ }
+ if (cv == NULL) {
+ ERR(REG_ESPACE);
+ }
+ return cv;
+}
+
+/*
+ * allcases - supply cvec for all case counterparts of a chr (including itself)
+ *
+ * This is a shortcut, preferably an efficient one, for simple characters;
+ * messy cases are done via range().
+ */
+static struct cvec *
+allcases(struct vars *v, /* context */
+ chr pc) /* character to get case equivs of */
+{
+ struct cvec *cv;
+ chr c = (chr)pc;
+ chr lc, uc;
+
+ lc = pg_tolower((chr)c);
+ uc = pg_toupper((chr)c);
+
+ cv = getcvec(v, 2, 0, 0);
+ addchr(cv, lc);
+ if (lc != uc) {
+ addchr(cv, uc);
+ }
+ return cv;
+}
+
+/*
+ * cmp - chr-substring compare
+ *
+ * Backrefs need this. It should preferably be efficient.
+ * Note that it does not need to report anything except equal/unequal.
+ * Note also that the length is exact, and the comparison should not
+ * stop at embedded NULs!
+ */
+static int /* 0 for equal, nonzero for unequal */
+cmp(const chr *x, const chr *y, /* strings to compare */
+ size_t len) /* exact length of comparison */
+{
+ return memcmp(VS(x), VS(y), len*sizeof(chr));
+}
+
+/*
+ * casecmp - case-independent chr-substring compare
+ *
+ * REG_ICASE backrefs need this. It should preferably be efficient.
+ * Note that it does not need to report anything except equal/unequal.
+ * Note also that the length is exact, and the comparison should not
+ * stop at embedded NULs!
+ */
+static int /* 0 for equal, nonzero for unequal */
+casecmp(const chr *x, const chr *y, /* strings to compare */
+ size_t len) /* exact length of comparison */
+{
+ for (; len > 0; len--, x++, y++) {
+ if ((*x!=*y) && (pg_tolower(*x) != pg_tolower(*y))) {
+ return 1;
+ }
+ }
+ return 0;
+}
--- /dev/null
+/*
+ * NFA utilities.
+ * This file is #included by regcomp.c.
+ *
+ * Copyright (c) 1998, 1999 Henry Spencer. All rights reserved.
+ *
+ * Development of this software was funded, in part, by Cray Research Inc.,
+ * UUNET Communications Services Inc., Sun Microsystems Inc., and Scriptics
+ * Corporation, none of whom are responsible for the results. The author
+ * thanks all of them.
+ *
+ * Redistribution and use in source and binary forms -- with or without
+ * modification -- are permitted for any purpose, provided that
+ * redistributions in source form retain this entire copyright notice and
+ * indicate the origin and nature of any modifications.
+ *
+ * I'd appreciate being given credit for this package in the documentation
+ * of software which uses it, but that is not a requirement.
+ *
+ * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES,
+ * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+ * AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL
+ * HENRY SPENCER BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
+ * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
+ * WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR
+ * OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
+ * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ * $Header: /cvsroot/pgsql/src/backend/regex/regc_nfa.c,v 1.1 2003/02/05 17:41:32 tgl Exp $
+ *
+ *
+ * One or two things that technically ought to be in here
+ * are actually in color.c, thanks to some incestuous relationships in
+ * the color chains.
+ */
+
+#define NISERR() VISERR(nfa->v)
+#define NERR(e) VERR(nfa->v, (e))
+
+
+/*
+ * newnfa - set up an NFA
+ */
+static struct nfa * /* the NFA, or NULL */
+newnfa(struct vars *v,
+ struct colormap *cm,
+ struct nfa *parent) /* NULL if primary NFA */
+{
+ struct nfa *nfa;
+
+ nfa = (struct nfa *)MALLOC(sizeof(struct nfa));
+ if (nfa == NULL)
+ return NULL;
+
+ nfa->states = NULL;
+ nfa->slast = NULL;
+ nfa->free = NULL;
+ nfa->nstates = 0;
+ nfa->cm = cm;
+ nfa->v = v;
+ nfa->bos[0] = nfa->bos[1] = COLORLESS;
+ nfa->eos[0] = nfa->eos[1] = COLORLESS;
+ nfa->post = newfstate(nfa, '@'); /* number 0 */
+ nfa->pre = newfstate(nfa, '>'); /* number 1 */
+ nfa->parent = parent;
+
+ nfa->init = newstate(nfa); /* may become invalid later */
+ nfa->final = newstate(nfa);
+ if (ISERR()) {
+ freenfa(nfa);
+ return NULL;
+ }
+ rainbow(nfa, nfa->cm, PLAIN, COLORLESS, nfa->pre, nfa->init);
+ newarc(nfa, '^', 1, nfa->pre, nfa->init);
+ newarc(nfa, '^', 0, nfa->pre, nfa->init);
+ rainbow(nfa, nfa->cm, PLAIN, COLORLESS, nfa->final, nfa->post);
+ newarc(nfa, '$', 1, nfa->final, nfa->post);
+ newarc(nfa, '$', 0, nfa->final, nfa->post);
+
+ if (ISERR()) {
+ freenfa(nfa);
+ return NULL;
+ }
+ return nfa;
+}
+
+/*
+ * freenfa - free an entire NFA
+ */
+static void
+freenfa(struct nfa *nfa)
+{
+ struct state *s;
+
+ while ((s = nfa->states) != NULL) {
+ s->nins = s->nouts = 0; /* don't worry about arcs */
+ freestate(nfa, s);
+ }
+ while ((s = nfa->free) != NULL) {
+ nfa->free = s->next;
+ destroystate(nfa, s);
+ }
+
+ nfa->slast = NULL;
+ nfa->nstates = -1;
+ nfa->pre = NULL;
+ nfa->post = NULL;
+ FREE(nfa);
+}
+
+/*
+ * newstate - allocate an NFA state, with zero flag value
+ */
+static struct state * /* NULL on error */
+newstate(struct nfa *nfa)
+{
+ struct state *s;
+
+ if (nfa->free != NULL) {
+ s = nfa->free;
+ nfa->free = s->next;
+ } else {
+ s = (struct state *)MALLOC(sizeof(struct state));
+ if (s == NULL) {
+ NERR(REG_ESPACE);
+ return NULL;
+ }
+ s->oas.next = NULL;
+ s->free = NULL;
+ s->noas = 0;
+ }
+
+ assert(nfa->nstates >= 0);
+ s->no = nfa->nstates++;
+ s->flag = 0;
+ if (nfa->states == NULL)
+ nfa->states = s;
+ s->nins = 0;
+ s->ins = NULL;
+ s->nouts = 0;
+ s->outs = NULL;
+ s->tmp = NULL;
+ s->next = NULL;
+ if (nfa->slast != NULL) {
+ assert(nfa->slast->next == NULL);
+ nfa->slast->next = s;
+ }
+ s->prev = nfa->slast;
+ nfa->slast = s;
+ return s;
+}
+
+/*
+ * newfstate - allocate an NFA state with a specified flag value
+ */
+static struct state * /* NULL on error */
+newfstate(struct nfa *nfa, int flag)
+{
+ struct state *s;
+
+ s = newstate(nfa);
+ if (s != NULL)
+ s->flag = (char)flag;
+ return s;
+}
+
+/*
+ * dropstate - delete a state's inarcs and outarcs and free it
+ */
+static void
+dropstate(struct nfa *nfa,
+ struct state *s)
+{
+ struct arc *a;
+
+ while ((a = s->ins) != NULL)
+ freearc(nfa, a);
+ while ((a = s->outs) != NULL)
+ freearc(nfa, a);
+ freestate(nfa, s);
+}
+
+/*
+ * freestate - free a state, which has no in-arcs or out-arcs
+ */
+static void
+freestate(struct nfa *nfa,
+ struct state *s)
+{
+ assert(s != NULL);
+ assert(s->nins == 0 && s->nouts == 0);
+
+ s->no = FREESTATE;
+ s->flag = 0;
+ if (s->next != NULL)
+ s->next->prev = s->prev;
+ else {
+ assert(s == nfa->slast);
+ nfa->slast = s->prev;
+ }
+ if (s->prev != NULL)
+ s->prev->next = s->next;
+ else {
+ assert(s == nfa->states);
+ nfa->states = s->next;
+ }
+ s->prev = NULL;
+ s->next = nfa->free; /* don't delete it, put it on the free list */
+ nfa->free = s;
+}
+
+/*
+ * destroystate - really get rid of an already-freed state
+ */
+static void
+destroystate(struct nfa *nfa,
+ struct state *s)
+{
+ struct arcbatch *ab;
+ struct arcbatch *abnext;
+
+ assert(s->no == FREESTATE);
+ for (ab = s->oas.next; ab != NULL; ab = abnext) {
+ abnext = ab->next;
+ FREE(ab);
+ }
+ s->ins = NULL;
+ s->outs = NULL;
+ s->next = NULL;
+ FREE(s);
+}
+
+/*
+ * newarc - set up a new arc within an NFA
+ */
+static void
+newarc(struct nfa *nfa,
+ int t,
+ pcolor co,
+ struct state *from,
+ struct state *to)
+{
+ struct arc *a;
+
+ assert(from != NULL && to != NULL);
+
+ /* check for duplicates */
+ for (a = from->outs; a != NULL; a = a->outchain)
+ if (a->to == to && a->co == co && a->type == t)
+ return;
+
+ a = allocarc(nfa, from);
+ if (NISERR())
+ return;
+ assert(a != NULL);
+
+ a->type = t;
+ a->co = (color)co;
+ a->to = to;
+ a->from = from;
+
+ /*
+ * Put the new arc on the beginning, not the end, of the chains.
+ * Not only is this easier, it has the very useful side effect that
+ * deleting the most-recently-added arc is the cheapest case rather
+ * than the most expensive one.
+ */
+ a->inchain = to->ins;
+ to->ins = a;
+ a->outchain = from->outs;
+ from->outs = a;
+
+ from->nouts++;
+ to->nins++;
+
+ if (COLORED(a) && nfa->parent == NULL)
+ colorchain(nfa->cm, a);
+
+ return;
+}
+
+/*
+ * allocarc - allocate a new out-arc within a state
+ */
+static struct arc * /* NULL for failure */
+allocarc(struct nfa *nfa,
+ struct state *s)
+{
+ struct arc *a;
+ struct arcbatch *new;
+ int i;
+
+ /* shortcut */
+ if (s->free == NULL && s->noas < ABSIZE) {
+ a = &s->oas.a[s->noas];
+ s->noas++;
+ return a;
+ }
+
+ /* if none at hand, get more */
+ if (s->free == NULL) {
+ new = (struct arcbatch *)MALLOC(sizeof(struct arcbatch));
+ if (new == NULL) {
+ NERR(REG_ESPACE);
+ return NULL;
+ }
+ new->next = s->oas.next;
+ s->oas.next = new;
+
+ for (i = 0; i < ABSIZE; i++) {
+ new->a[i].type = 0;
+ new->a[i].freechain = &new->a[i+1];
+ }
+ new->a[ABSIZE-1].freechain = NULL;
+ s->free = &new->a[0];
+ }
+ assert(s->free != NULL);
+
+ a = s->free;
+ s->free = a->freechain;
+ return a;
+}
+
+/*
+ * freearc - free an arc
+ */
+static void
+freearc(struct nfa *nfa,
+ struct arc *victim)
+{
+ struct state *from = victim->from;
+ struct state *to = victim->to;
+ struct arc *a;
+
+ assert(victim->type != 0);
+
+ /* take it off color chain if necessary */
+ if (COLORED(victim) && nfa->parent == NULL)
+ uncolorchain(nfa->cm, victim);
+
+ /* take it off source's out-chain */
+ assert(from != NULL);
+ assert(from->outs != NULL);
+ a = from->outs;
+ if (a == victim) /* simple case: first in chain */
+ from->outs = victim->outchain;
+ else {
+ for (; a != NULL && a->outchain != victim; a = a->outchain)
+ continue;
+ assert(a != NULL);
+ a->outchain = victim->outchain;
+ }
+ from->nouts--;
+
+ /* take it off target's in-chain */
+ assert(to != NULL);
+ assert(to->ins != NULL);
+ a = to->ins;
+ if (a == victim) /* simple case: first in chain */
+ to->ins = victim->inchain;
+ else {
+ for (; a != NULL && a->inchain != victim; a = a->inchain)
+ continue;
+ assert(a != NULL);
+ a->inchain = victim->inchain;
+ }
+ to->nins--;
+
+ /* clean up and place on free list */
+ victim->type = 0;
+ victim->from = NULL; /* precautions... */
+ victim->to = NULL;
+ victim->inchain = NULL;
+ victim->outchain = NULL;
+ victim->freechain = from->free;
+ from->free = victim;
+}
+
+/*
+ * findarc - find arc, if any, from given source with given type and color
+ * If there is more than one such arc, the result is random.
+ */
+static struct arc *
+findarc(struct state *s,
+ int type,
+ pcolor co)
+{
+ struct arc *a;
+
+ for (a = s->outs; a != NULL; a = a->outchain)
+ if (a->type == type && a->co == co)
+ return a;
+ return NULL;
+}
+
+/*
+ * cparc - allocate a new arc within an NFA, copying details from old one
+ */
+static void
+cparc(struct nfa *nfa,
+ struct arc *oa,
+ struct state *from,
+ struct state *to)
+{
+ newarc(nfa, oa->type, oa->co, from, to);
+}
+
+/*
+ * moveins - move all in arcs of a state to another state
+ *
+ * You might think this could be done better by just updating the
+ * existing arcs, and you would be right if it weren't for the desire
+ * for duplicate suppression, which makes it easier to just make new
+ * ones to exploit the suppression built into newarc.
+ */
+static void
+moveins(struct nfa *nfa,
+ struct state *old,
+ struct state *new)
+{
+ struct arc *a;
+
+ assert(old != new);
+
+ while ((a = old->ins) != NULL) {
+ cparc(nfa, a, a->from, new);
+ freearc(nfa, a);
+ }
+ assert(old->nins == 0);
+ assert(old->ins == NULL);
+}
+
+/*
+ * copyins - copy all in arcs of a state to another state
+ */
+static void
+copyins(struct nfa *nfa,
+ struct state *old,
+ struct state *new)
+{
+ struct arc *a;
+
+ assert(old != new);
+
+ for (a = old->ins; a != NULL; a = a->inchain)
+ cparc(nfa, a, a->from, new);
+}
+
+/*
+ * moveouts - move all out arcs of a state to another state
+ */
+static void
+moveouts(struct nfa *nfa,
+ struct state *old,
+ struct state *new)
+{
+ struct arc *a;
+
+ assert(old != new);
+
+ while ((a = old->outs) != NULL) {
+ cparc(nfa, a, new, a->to);
+ freearc(nfa, a);
+ }
+}
+
+/*
+ * copyouts - copy all out arcs of a state to another state
+ */
+static void
+copyouts(struct nfa *nfa,
+ struct state *old,
+ struct state *new)
+{
+ struct arc *a;
+
+ assert(old != new);
+
+ for (a = old->outs; a != NULL; a = a->outchain)
+ cparc(nfa, a, new, a->to);
+}
+
+/*
+ * cloneouts - copy out arcs of a state to another state pair, modifying type
+ */
+static void
+cloneouts(struct nfa *nfa,
+ struct state *old,
+ struct state *from,
+ struct state *to,
+ int type)
+{
+ struct arc *a;
+
+ assert(old != from);
+
+ for (a = old->outs; a != NULL; a = a->outchain)
+ newarc(nfa, type, a->co, from, to);
+}
+
+/*
+ * delsub - delete a sub-NFA, updating subre pointers if necessary
+ *
+ * This uses a recursive traversal of the sub-NFA, marking already-seen
+ * states using their tmp pointer.
+ */
+static void
+delsub(struct nfa *nfa,
+ struct state *lp, /* the sub-NFA goes from here... */
+ struct state *rp) /* ...to here, *not* inclusive */
+{
+ assert(lp != rp);
+
+ rp->tmp = rp; /* mark end */
+
+ deltraverse(nfa, lp, lp);
+ assert(lp->nouts == 0 && rp->nins == 0); /* did the job */
+ assert(lp->no != FREESTATE && rp->no != FREESTATE); /* no more */
+
+ rp->tmp = NULL; /* unmark end */
+ lp->tmp = NULL; /* and begin, marked by deltraverse */
+}
+
+/*
+ * deltraverse - the recursive heart of delsub
+ * This routine's basic job is to destroy all out-arcs of the state.
+ */
+static void
+deltraverse(struct nfa *nfa,
+ struct state *leftend,
+ struct state *s)
+{
+ struct arc *a;
+ struct state *to;
+
+ if (s->nouts == 0)
+ return; /* nothing to do */
+ if (s->tmp != NULL)
+ return; /* already in progress */
+
+ s->tmp = s; /* mark as in progress */
+
+ while ((a = s->outs) != NULL) {
+ to = a->to;
+ deltraverse(nfa, leftend, to);
+ assert(to->nouts == 0 || to->tmp != NULL);
+ freearc(nfa, a);
+ if (to->nins == 0 && to->tmp == NULL) {
+ assert(to->nouts == 0);
+ freestate(nfa, to);
+ }
+ }
+
+ assert(s->no != FREESTATE); /* we're still here */
+ assert(s == leftend || s->nins != 0); /* and still reachable */
+ assert(s->nouts == 0); /* but have no outarcs */
+
+ s->tmp = NULL; /* we're done here */
+}
+
+/*
+ * dupnfa - duplicate sub-NFA
+ *
+ * Another recursive traversal, this time using tmp to point to duplicates
+ * as well as mark already-seen states. (You knew there was a reason why
+ * it's a state pointer, didn't you? :-))
+ */
+static void
+dupnfa(struct nfa *nfa,
+ struct state *start, /* duplicate of subNFA starting here */
+ struct state *stop, /* and stopping here */
+ struct state *from, /* stringing duplicate from here */
+ struct state *to) /* to here */
+{
+ if (start == stop) {
+ newarc(nfa, EMPTY, 0, from, to);
+ return;
+ }
+
+ stop->tmp = to;
+ duptraverse(nfa, start, from);
+ /* done, except for clearing out the tmp pointers */
+
+ stop->tmp = NULL;
+ cleartraverse(nfa, start);
+}
+
+/*
+ * duptraverse - recursive heart of dupnfa
+ */
+static void
+duptraverse(struct nfa *nfa,
+ struct state *s,
+ struct state *stmp) /* s's duplicate, or NULL */
+{
+ struct arc *a;
+
+ if (s->tmp != NULL)
+ return; /* already done */
+
+ s->tmp = (stmp == NULL) ? newstate(nfa) : stmp;
+ if (s->tmp == NULL) {
+ assert(NISERR());
+ return;
+ }
+
+ for (a = s->outs; a != NULL && !NISERR(); a = a->outchain) {
+ duptraverse(nfa, a->to, (struct state *)NULL);
+ assert(a->to->tmp != NULL);
+ cparc(nfa, a, s->tmp, a->to->tmp);
+ }
+}
+
+/*
+ * cleartraverse - recursive cleanup for algorithms that leave tmp ptrs set
+ */
+static void
+cleartraverse(struct nfa *nfa,
+ struct state *s)
+{
+ struct arc *a;
+
+ if (s->tmp == NULL)
+ return;
+ s->tmp = NULL;
+
+ for (a = s->outs; a != NULL; a = a->outchain)
+ cleartraverse(nfa, a->to);
+}
+
+/*
+ * specialcolors - fill in special colors for an NFA
+ */
+static void
+specialcolors(struct nfa *nfa)
+{
+ /* false colors for BOS, BOL, EOS, EOL */
+ if (nfa->parent == NULL) {
+ nfa->bos[0] = pseudocolor(nfa->cm);
+ nfa->bos[1] = pseudocolor(nfa->cm);
+ nfa->eos[0] = pseudocolor(nfa->cm);
+ nfa->eos[1] = pseudocolor(nfa->cm);
+ } else {
+ assert(nfa->parent->bos[0] != COLORLESS);
+ nfa->bos[0] = nfa->parent->bos[0];
+ assert(nfa->parent->bos[1] != COLORLESS);
+ nfa->bos[1] = nfa->parent->bos[1];
+ assert(nfa->parent->eos[0] != COLORLESS);
+ nfa->eos[0] = nfa->parent->eos[0];
+ assert(nfa->parent->eos[1] != COLORLESS);
+ nfa->eos[1] = nfa->parent->eos[1];
+ }
+}
+
+/*
+ * optimize - optimize an NFA
+ */
+static long /* re_info bits */
+optimize(struct nfa *nfa,
+ FILE *f) /* for debug output; NULL none */
+{
+#ifdef REG_DEBUG
+ int verbose = (f != NULL) ? 1 : 0;
+
+ if (verbose)
+ fprintf(f, "\ninitial cleanup:\n");
+#endif
+ cleanup(nfa); /* may simplify situation */
+#ifdef REG_DEBUG
+ if (verbose)
+ dumpnfa(nfa, f);
+ if (verbose)
+ fprintf(f, "\nempties:\n");
+#endif
+ fixempties(nfa, f); /* get rid of EMPTY arcs */
+#ifdef REG_DEBUG
+ if (verbose)
+ fprintf(f, "\nconstraints:\n");
+#endif
+ pullback(nfa, f); /* pull back constraints backward */
+ pushfwd(nfa, f); /* push fwd constraints forward */
+#ifdef REG_DEBUG
+ if (verbose)
+ fprintf(f, "\nfinal cleanup:\n");
+#endif
+ cleanup(nfa); /* final tidying */
+ return analyze(nfa); /* and analysis */
+}
+
+/*
+ * pullback - pull back constraints backward to (with luck) eliminate them
+ */
+static void
+pullback(struct nfa *nfa,
+ FILE *f) /* for debug output; NULL none */
+{
+ struct state *s;
+ struct state *nexts;
+ struct arc *a;
+ struct arc *nexta;
+ int progress;
+
+ /* find and pull until there are no more */
+ do {
+ progress = 0;
+ for (s = nfa->states; s != NULL && !NISERR(); s = nexts) {
+ nexts = s->next;
+ for (a = s->outs; a != NULL && !NISERR(); a = nexta) {
+ nexta = a->outchain;
+ if (a->type == '^' || a->type == BEHIND)
+ if (pull(nfa, a))
+ progress = 1;
+ assert(nexta == NULL || s->no != FREESTATE);
+ }
+ }
+ if (progress && f != NULL)
+ dumpnfa(nfa, f);
+ } while (progress && !NISERR());
+ if (NISERR())
+ return;
+
+ for (a = nfa->pre->outs; a != NULL; a = nexta) {
+ nexta = a->outchain;
+ if (a->type == '^') {
+ assert(a->co == 0 || a->co == 1);
+ newarc(nfa, PLAIN, nfa->bos[a->co], a->from, a->to);
+ freearc(nfa, a);
+ }
+ }
+}
+
+/*
+ * pull - pull a back constraint backward past its source state
+ * A significant property of this function is that it deletes at most
+ * one state -- the constraint's from state -- and only if the constraint
+ * was that state's last outarc.
+ */
+static int /* 0 couldn't, 1 could */
+pull(struct nfa *nfa,
+ struct arc *con)
+{
+ struct state *from = con->from;
+ struct state *to = con->to;
+ struct arc *a;
+ struct arc *nexta;
+ struct state *s;
+
+ if (from == to) { /* circular constraint is pointless */
+ freearc(nfa, con);
+ return 1;
+ }
+ if (from->flag) /* can't pull back beyond start */
+ return 0;
+ if (from->nins == 0) { /* unreachable */
+ freearc(nfa, con);
+ return 1;
+ }
+
+ /* first, clone from state if necessary to avoid other outarcs */
+ if (from->nouts > 1) {
+ s = newstate(nfa);
+ if (NISERR())
+ return 0;
+ assert(to != from); /* con is not an inarc */
+ copyins(nfa, from, s); /* duplicate inarcs */
+ cparc(nfa, con, s, to); /* move constraint arc */
+ freearc(nfa, con);
+ from = s;
+ con = from->outs;
+ }
+ assert(from->nouts == 1);
+
+ /* propagate the constraint into the from state's inarcs */
+ for (a = from->ins; a != NULL; a = nexta) {
+ nexta = a->inchain;
+ switch (combine(con, a)) {
+ case INCOMPATIBLE: /* destroy the arc */
+ freearc(nfa, a);
+ break;
+ case SATISFIED: /* no action needed */
+ break;
+ case COMPATIBLE: /* swap the two arcs, more or less */
+ s = newstate(nfa);
+ if (NISERR())
+ return 0;
+ cparc(nfa, a, s, to); /* anticipate move */
+ cparc(nfa, con, a->from, s);
+ if (NISERR())
+ return 0;
+ freearc(nfa, a);
+ break;
+ default:
+ assert(NOTREACHED);
+ break;
+ }
+ }
+
+ /* remaining inarcs, if any, incorporate the constraint */
+ moveins(nfa, from, to);
+ dropstate(nfa, from); /* will free the constraint */
+ return 1;
+}
+
+/*
+ * pushfwd - push forward constraints forward to (with luck) eliminate them
+ */
+static void
+pushfwd(struct nfa *nfa,
+ FILE *f) /* for debug output; NULL none */
+{
+ struct state *s;
+ struct state *nexts;
+ struct arc *a;
+ struct arc *nexta;
+ int progress;
+
+ /* find and push until there are no more */
+ do {
+ progress = 0;
+ for (s = nfa->states; s != NULL && !NISERR(); s = nexts) {
+ nexts = s->next;
+ for (a = s->ins; a != NULL && !NISERR(); a = nexta) {
+ nexta = a->inchain;
+ if (a->type == '$' || a->type == AHEAD)
+ if (push(nfa, a))
+ progress = 1;
+ assert(nexta == NULL || s->no != FREESTATE);
+ }
+ }
+ if (progress && f != NULL)
+ dumpnfa(nfa, f);
+ } while (progress && !NISERR());
+ if (NISERR())
+ return;
+
+ for (a = nfa->post->ins; a != NULL; a = nexta) {
+ nexta = a->inchain;
+ if (a->type == '$') {
+ assert(a->co == 0 || a->co == 1);
+ newarc(nfa, PLAIN, nfa->eos[a->co], a->from, a->to);
+ freearc(nfa, a);
+ }
+ }
+}
+
+/*
+ * push - push a forward constraint forward past its destination state
+ * A significant property of this function is that it deletes at most
+ * one state -- the constraint's to state -- and only if the constraint
+ * was that state's last inarc.
+ */
+static int /* 0 couldn't, 1 could */
+push(struct nfa *nfa,
+ struct arc *con)
+{
+ struct state *from = con->from;
+ struct state *to = con->to;
+ struct arc *a;
+ struct arc *nexta;
+ struct state *s;
+
+ if (to == from) { /* circular constraint is pointless */
+ freearc(nfa, con);
+ return 1;
+ }
+ if (to->flag) /* can't push forward beyond end */
+ return 0;
+ if (to->nouts == 0) { /* dead end */
+ freearc(nfa, con);
+ return 1;
+ }
+
+ /* first, clone to state if necessary to avoid other inarcs */
+ if (to->nins > 1) {
+ s = newstate(nfa);
+ if (NISERR())
+ return 0;
+ copyouts(nfa, to, s); /* duplicate outarcs */
+ cparc(nfa, con, from, s); /* move constraint */
+ freearc(nfa, con);
+ to = s;
+ con = to->ins;
+ }
+ assert(to->nins == 1);
+
+ /* propagate the constraint into the to state's outarcs */
+ for (a = to->outs; a != NULL; a = nexta) {
+ nexta = a->outchain;
+ switch (combine(con, a)) {
+ case INCOMPATIBLE: /* destroy the arc */
+ freearc(nfa, a);
+ break;
+ case SATISFIED: /* no action needed */
+ break;
+ case COMPATIBLE: /* swap the two arcs, more or less */
+ s = newstate(nfa);
+ if (NISERR())
+ return 0;
+ cparc(nfa, con, s, a->to); /* anticipate move */
+ cparc(nfa, a, from, s);
+ if (NISERR())
+ return 0;
+ freearc(nfa, a);
+ break;
+ default:
+ assert(NOTREACHED);
+ break;
+ }
+ }
+
+ /* remaining outarcs, if any, incorporate the constraint */
+ moveouts(nfa, to, from);
+ dropstate(nfa, to); /* will free the constraint */
+ return 1;
+}
+
+/*
+ * combine - constraint lands on an arc, what happens?
+ *
+ * #def INCOMPATIBLE 1 // destroys arc
+ * #def SATISFIED 2 // constraint satisfied
+ * #def COMPATIBLE 3 // compatible but not satisfied yet
+ */
+static int
+combine(struct arc *con,
+ struct arc *a)
+{
+# define CA(ct,at) (((ct)<<CHAR_BIT) | (at))
+
+ switch (CA(con->type, a->type)) {
+ case CA('^', PLAIN): /* newlines are handled separately */
+ case CA('$', PLAIN):
+ return INCOMPATIBLE;
+ break;
+ case CA(AHEAD, PLAIN): /* color constraints meet colors */
+ case CA(BEHIND, PLAIN):
+ if (con->co == a->co)
+ return SATISFIED;
+ return INCOMPATIBLE;
+ break;
+ case CA('^', '^'): /* collision, similar constraints */
+ case CA('$', '$'):
+ case CA(AHEAD, AHEAD):
+ case CA(BEHIND, BEHIND):
+ if (con->co == a->co) /* true duplication */
+ return SATISFIED;
+ return INCOMPATIBLE;
+ break;
+ case CA('^', BEHIND): /* collision, dissimilar constraints */
+ case CA(BEHIND, '^'):
+ case CA('$', AHEAD):
+ case CA(AHEAD, '$'):
+ return INCOMPATIBLE;
+ break;
+ case CA('^', '$'): /* constraints passing each other */
+ case CA('^', AHEAD):
+ case CA(BEHIND, '$'):
+ case CA(BEHIND, AHEAD):
+ case CA('$', '^'):
+ case CA('$', BEHIND):
+ case CA(AHEAD, '^'):
+ case CA(AHEAD, BEHIND):
+ case CA('^', LACON):
+ case CA(BEHIND, LACON):
+ case CA('$', LACON):
+ case CA(AHEAD, LACON):
+ return COMPATIBLE;
+ break;
+ }
+ assert(NOTREACHED);
+ return INCOMPATIBLE; /* for benefit of blind compilers */
+}
+
+/*
+ * fixempties - get rid of EMPTY arcs
+ */
+static void
+fixempties(struct nfa *nfa,
+ FILE *f) /* for debug output; NULL none */
+{
+ struct state *s;
+ struct state *nexts;
+ struct arc *a;
+ struct arc *nexta;
+ int progress;
+
+ /* find and eliminate empties until there are no more */
+ do {
+ progress = 0;
+ for (s = nfa->states; s != NULL && !NISERR(); s = nexts) {
+ nexts = s->next;
+ for (a = s->outs; a != NULL && !NISERR(); a = nexta) {
+ nexta = a->outchain;
+ if (a->type == EMPTY && unempty(nfa, a))
+ progress = 1;
+ assert(nexta == NULL || s->no != FREESTATE);
+ }
+ }
+ if (progress && f != NULL)
+ dumpnfa(nfa, f);
+ } while (progress && !NISERR());
+}
+
+/*
+ * unempty - optimize out an EMPTY arc, if possible
+ *
+ * Actually, as it stands this function always succeeds, but the return
+ * value is kept with an eye on possible future changes.
+ */
+static int /* 0 couldn't, 1 could */
+unempty(struct nfa *nfa,
+ struct arc *a)
+{
+ struct state *from = a->from;
+ struct state *to = a->to;
+ int usefrom; /* work on from, as opposed to to? */
+
+ assert(a->type == EMPTY);
+ assert(from != nfa->pre && to != nfa->post);
+
+ if (from == to) { /* vacuous loop */
+ freearc(nfa, a);
+ return 1;
+ }
+
+ /* decide which end to work on */
+ usefrom = 1; /* default: attack from */
+ if (from->nouts > to->nins)
+ usefrom = 0;
+ else if (from->nouts == to->nins) {
+ /* decide on secondary issue: move/copy fewest arcs */
+ if (from->nins > to->nouts)
+ usefrom = 0;
+ }
+
+ freearc(nfa, a);
+ if (usefrom) {
+ if (from->nouts == 0) {
+ /* was the state's only outarc */
+ moveins(nfa, from, to);
+ freestate(nfa, from);
+ } else
+ copyins(nfa, from, to);
+ } else {
+ if (to->nins == 0) {
+ /* was the state's only inarc */
+ moveouts(nfa, to, from);
+ freestate(nfa, to);
+ } else
+ copyouts(nfa, to, from);
+ }
+
+ return 1;
+}
+
+/*
+ * cleanup - clean up NFA after optimizations
+ */
+static void
+cleanup(struct nfa *nfa)
+{
+ struct state *s;
+ struct state *nexts;
+ int n;
+
+ /* clear out unreachable or dead-end states */
+ /* use pre to mark reachable, then post to mark can-reach-post */
+ markreachable(nfa, nfa->pre, (struct state *)NULL, nfa->pre);
+ markcanreach(nfa, nfa->post, nfa->pre, nfa->post);
+ for (s = nfa->states; s != NULL; s = nexts) {
+ nexts = s->next;
+ if (s->tmp != nfa->post && !s->flag)
+ dropstate(nfa, s);
+ }
+ assert(nfa->post->nins == 0 || nfa->post->tmp == nfa->post);
+ cleartraverse(nfa, nfa->pre);
+ assert(nfa->post->nins == 0 || nfa->post->tmp == NULL);
+ /* the nins==0 (final unreachable) case will be caught later */
+
+ /* renumber surviving states */
+ n = 0;
+ for (s = nfa->states; s != NULL; s = s->next)
+ s->no = n++;
+ nfa->nstates = n;
+}
+
+/*
+ * markreachable - recursive marking of reachable states
+ */
+static void
+markreachable(struct nfa *nfa,
+ struct state *s,
+ struct state *okay, /* consider only states with this mark */
+ struct state *mark) /* the value to mark with */
+{
+ struct arc *a;
+
+ if (s->tmp != okay)
+ return;
+ s->tmp = mark;
+
+ for (a = s->outs; a != NULL; a = a->outchain)
+ markreachable(nfa, a->to, okay, mark);
+}
+
+/*
+ * markcanreach - recursive marking of states which can reach here
+ */
+static void
+markcanreach(struct nfa *nfa,
+ struct state *s,
+ struct state *okay, /* consider only states with this mark */
+ struct state *mark) /* the value to mark with */
+{
+ struct arc *a;
+
+ if (s->tmp != okay)
+ return;
+ s->tmp = mark;
+
+ for (a = s->ins; a != NULL; a = a->inchain)
+ markcanreach(nfa, a->from, okay, mark);
+}
+
+/*
+ * analyze - ascertain potentially-useful facts about an optimized NFA
+ */
+static long /* re_info bits to be ORed in */
+analyze(struct nfa *nfa)
+{
+ struct arc *a;
+ struct arc *aa;
+
+ if (nfa->pre->outs == NULL)
+ return REG_UIMPOSSIBLE;
+ for (a = nfa->pre->outs; a != NULL; a = a->outchain)
+ for (aa = a->to->outs; aa != NULL; aa = aa->outchain)
+ if (aa->to == nfa->post)
+ return REG_UEMPTYMATCH;
+ return 0;
+}
+
+/*
+ * compact - compact an NFA
+ */
+static void
+compact(struct nfa *nfa,
+ struct cnfa *cnfa)
+{
+ struct state *s;
+ struct arc *a;
+ size_t nstates;
+ size_t narcs;
+ struct carc *ca;
+ struct carc *first;
+
+ assert (!NISERR());
+
+ nstates = 0;
+ narcs = 0;
+ for (s = nfa->states; s != NULL; s = s->next) {
+ nstates++;
+ narcs += 1 + s->nouts + 1;
+ /* 1 as a fake for flags, nouts for arcs, 1 as endmarker */
+ }
+
+ cnfa->states = (struct carc **)MALLOC(nstates * sizeof(struct carc *));
+ cnfa->arcs = (struct carc *)MALLOC(narcs * sizeof(struct carc));
+ if (cnfa->states == NULL || cnfa->arcs == NULL) {
+ if (cnfa->states != NULL)
+ FREE(cnfa->states);
+ if (cnfa->arcs != NULL)
+ FREE(cnfa->arcs);
+ NERR(REG_ESPACE);
+ return;
+ }
+ cnfa->nstates = nstates;
+ cnfa->pre = nfa->pre->no;
+ cnfa->post = nfa->post->no;
+ cnfa->bos[0] = nfa->bos[0];
+ cnfa->bos[1] = nfa->bos[1];
+ cnfa->eos[0] = nfa->eos[0];
+ cnfa->eos[1] = nfa->eos[1];
+ cnfa->ncolors = maxcolor(nfa->cm) + 1;
+ cnfa->flags = 0;
+
+ ca = cnfa->arcs;
+ for (s = nfa->states; s != NULL; s = s->next) {
+ assert((size_t)s->no < nstates);
+ cnfa->states[s->no] = ca;
+ ca->co = 0; /* clear and skip flags "arc" */
+ ca++;
+ first = ca;
+ for (a = s->outs; a != NULL; a = a->outchain)
+ switch (a->type) {
+ case PLAIN:
+ ca->co = a->co;
+ ca->to = a->to->no;
+ ca++;
+ break;
+ case LACON:
+ assert(s->no != cnfa->pre);
+ ca->co = (color)(cnfa->ncolors + a->co);
+ ca->to = a->to->no;
+ ca++;
+ cnfa->flags |= HASLACONS;
+ break;
+ default:
+ assert(NOTREACHED);
+ break;
+ }
+ carcsort(first, ca-1);
+ ca->co = COLORLESS;
+ ca->to = 0;
+ ca++;
+ }
+ assert(ca == &cnfa->arcs[narcs]);
+ assert(cnfa->nstates != 0);
+
+ /* mark no-progress states */
+ for (a = nfa->pre->outs; a != NULL; a = a->outchain)
+ cnfa->states[a->to->no]->co = 1;
+ cnfa->states[nfa->pre->no]->co = 1;
+}
+
+/*
+ * carcsort - sort compacted-NFA arcs by color
+ *
+ * Really dumb algorithm, but if the list is long enough for that to matter,
+ * you're in real trouble anyway.
+ */
+static void
+carcsort(struct carc *first,
+ struct carc *last)
+{
+ struct carc *p;
+ struct carc *q;
+ struct carc tmp;
+
+ if (last - first <= 1)
+ return;
+
+ for (p = first; p <= last; p++)
+ for (q = p; q <= last; q++)
+ if (p->co > q->co ||
+ (p->co == q->co && p->to > q->to)) {
+ assert(p != q);
+ tmp = *p;
+ *p = *q;
+ *q = tmp;
+ }
+}
+
+/*
+ * freecnfa - free a compacted NFA
+ */
+static void
+freecnfa(struct cnfa *cnfa)
+{
+ assert(cnfa->nstates != 0); /* not empty already */
+ cnfa->nstates = 0;
+ FREE(cnfa->states);
+ FREE(cnfa->arcs);
+}
+
+/*
+ * dumpnfa - dump an NFA in human-readable form
+ */
+static void
+dumpnfa(struct nfa *nfa,
+ FILE *f)
+{
+#ifdef REG_DEBUG
+ struct state *s;
+
+ fprintf(f, "pre %d, post %d", nfa->pre->no, nfa->post->no);
+ if (nfa->bos[0] != COLORLESS)
+ fprintf(f, ", bos [%ld]", (long)nfa->bos[0]);
+ if (nfa->bos[1] != COLORLESS)
+ fprintf(f, ", bol [%ld]", (long)nfa->bos[1]);
+ if (nfa->eos[0] != COLORLESS)
+ fprintf(f, ", eos [%ld]", (long)nfa->eos[0]);
+ if (nfa->eos[1] != COLORLESS)
+ fprintf(f, ", eol [%ld]", (long)nfa->eos[1]);
+ fprintf(f, "\n");
+ for (s = nfa->states; s != NULL; s = s->next)
+ dumpstate(s, f);
+ if (nfa->parent == NULL)
+ dumpcolors(nfa->cm, f);
+ fflush(f);
+#endif
+}
+
+#ifdef REG_DEBUG /* subordinates of dumpnfa */
+
+/*
+ * dumpstate - dump an NFA state in human-readable form
+ */
+static void
+dumpstate(struct state *s,
+ FILE *f)
+{
+ struct arc *a;
+
+ fprintf(f, "%d%s%c", s->no, (s->tmp != NULL) ? "T" : "",
+ (s->flag) ? s->flag : '.');
+ if (s->prev != NULL && s->prev->next != s)
+ fprintf(f, "\tstate chain bad\n");
+ if (s->nouts == 0)
+ fprintf(f, "\tno out arcs\n");
+ else
+ dumparcs(s, f);
+ fflush(f);
+ for (a = s->ins; a != NULL; a = a->inchain) {
+ if (a->to != s)
+ fprintf(f, "\tlink from %d to %d on %d's in-chain\n",
+ a->from->no, a->to->no, s->no);
+ }
+}
+
+/*
+ * dumparcs - dump out-arcs in human-readable form
+ */
+static void
+dumparcs(struct state *s,
+ FILE *f)
+{
+ int pos;
+
+ assert(s->nouts > 0);
+ /* printing arcs in reverse order is usually clearer */
+ pos = dumprarcs(s->outs, s, f, 1);
+ if (pos != 1)
+ fprintf(f, "\n");
+}
+
+/*
+ * dumprarcs - dump remaining outarcs, recursively, in reverse order
+ */
+static int /* resulting print position */
+dumprarcs(struct arc *a,
+ struct state *s,
+ FILE *f,
+ int pos) /* initial print position */
+{
+ if (a->outchain != NULL)
+ pos = dumprarcs(a->outchain, s, f, pos);
+ dumparc(a, s, f);
+ if (pos == 5) {
+ fprintf(f, "\n");
+ pos = 1;
+ } else
+ pos++;
+ return pos;
+}
+
+/*
+ * dumparc - dump one outarc in readable form, including prefixing tab
+ */
+static void
+dumparc(struct arc *a,
+ struct state *s,
+ FILE *f)
+{
+ struct arc *aa;
+ struct arcbatch *ab;
+
+ fprintf(f, "\t");
+ switch (a->type) {
+ case PLAIN:
+ fprintf(f, "[%ld]", (long)a->co);
+ break;
+ case AHEAD:
+ fprintf(f, ">%ld>", (long)a->co);
+ break;
+ case BEHIND:
+ fprintf(f, "<%ld<", (long)a->co);
+ break;
+ case LACON:
+ fprintf(f, ":%ld:", (long)a->co);
+ break;
+ case '^':
+ case '$':
+ fprintf(f, "%c%d", a->type, (int)a->co);
+ break;
+ case EMPTY:
+ break;
+ default:
+ fprintf(f, "0x%x/0%lo", a->type, (long)a->co);
+ break;
+ }
+ if (a->from != s)
+ fprintf(f, "?%d?", a->from->no);
+ for (ab = &a->from->oas; ab != NULL; ab = ab->next) {
+ for (aa = &ab->a[0]; aa < &ab->a[ABSIZE]; aa++)
+ if (aa == a)
+ break; /* NOTE BREAK OUT */
+ if (aa < &ab->a[ABSIZE]) /* propagate break */
+ break; /* NOTE BREAK OUT */
+ }
+ if (ab == NULL)
+ fprintf(f, "?!?"); /* not in allocated space */
+ fprintf(f, "->");
+ if (a->to == NULL) {
+ fprintf(f, "NULL");
+ return;
+ }
+ fprintf(f, "%d", a->to->no);
+ for (aa = a->to->ins; aa != NULL; aa = aa->inchain)
+ if (aa == a)
+ break; /* NOTE BREAK OUT */
+ if (aa == NULL)
+ fprintf(f, "?!?"); /* missing from in-chain */
+}
+
+#endif /* REG_DEBUG */
+
+/*
+ * dumpcnfa - dump a compacted NFA in human-readable form
+ */
+#ifdef REG_DEBUG
+static void
+dumpcnfa(struct cnfa *cnfa,
+ FILE *f)
+{
+ int st;
+
+ fprintf(f, "pre %d, post %d", cnfa->pre, cnfa->post);
+ if (cnfa->bos[0] != COLORLESS)
+ fprintf(f, ", bos [%ld]", (long)cnfa->bos[0]);
+ if (cnfa->bos[1] != COLORLESS)
+ fprintf(f, ", bol [%ld]", (long)cnfa->bos[1]);
+ if (cnfa->eos[0] != COLORLESS)
+ fprintf(f, ", eos [%ld]", (long)cnfa->eos[0]);
+ if (cnfa->eos[1] != COLORLESS)
+ fprintf(f, ", eol [%ld]", (long)cnfa->eos[1]);
+ if (cnfa->flags&HASLACONS)
+ fprintf(f, ", haslacons");
+ fprintf(f, "\n");
+ for (st = 0; st < cnfa->nstates; st++)
+ dumpcstate(st, cnfa->states[st], cnfa, f);
+ fflush(f);
+}
+#endif
+
+#ifdef REG_DEBUG /* subordinates of dumpcnfa */
+
+/*
+ * dumpcstate - dump a compacted-NFA state in human-readable form
+ */
+static void
+dumpcstate(int st,
+ struct carc *ca,
+ struct cnfa *cnfa,
+ FILE *f)
+{
+ int i;
+ int pos;
+
+ fprintf(f, "%d%s", st, (ca[0].co) ? ":" : ".");
+ pos = 1;
+ for (i = 1; ca[i].co != COLORLESS; i++) {
+ if (ca[i].co < cnfa->ncolors)
+ fprintf(f, "\t[%ld]->%d", (long)ca[i].co, ca[i].to);
+ else
+ fprintf(f, "\t:%ld:->%d", (long)ca[i].co-cnfa->ncolors,
+ ca[i].to);
+ if (pos == 5) {
+ fprintf(f, "\n");
+ pos = 1;
+ } else
+ pos++;
+ }
+ if (i == 1 || pos != 1)
+ fprintf(f, "\n");
+ fflush(f);
+}
+
+#endif /* REG_DEBUG */
-/*-
- * Copyright (c) 1992, 1993, 1994 Henry Spencer.
- * Copyright (c) 1992, 1993, 1994
- * The Regents of the University of California. All rights reserved.
- *
- * This code is derived from software contributed to Berkeley by
- * Henry Spencer.
+/*
+ * re_*comp and friends - compile REs
+ * This file #includes several others (see the bottom).
*
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions
- * are met:
- * 1. Redistributions of source code must retain the above copyright
- * notice, this list of conditions and the following disclaimer.
- * 2. Redistributions in binary form must reproduce the above copyright
- * notice, this list of conditions and the following disclaimer in the
- * documentation and/or other materials provided with the distribution.
- * 3. All advertising materials mentioning features or use of this software
- * must display the following acknowledgement:
- * This product includes software developed by the University of
- * California, Berkeley and its contributors.
- * 4. Neither the name of the University nor the names of its contributors
- * may be used to endorse or promote products derived from this software
- * without specific prior written permission.
+ * Copyright (c) 1998, 1999 Henry Spencer. All rights reserved.
+ *
+ * Development of this software was funded, in part, by Cray Research Inc.,
+ * UUNET Communications Services Inc., Sun Microsystems Inc., and Scriptics
+ * Corporation, none of whom are responsible for the results. The author
+ * thanks all of them.
+ *
+ * Redistribution and use in source and binary forms -- with or without
+ * modification -- are permitted for any purpose, provided that
+ * redistributions in source form retain this entire copyright notice and
+ * indicate the origin and nature of any modifications.
+ *
+ * I'd appreciate being given credit for this package in the documentation
+ * of software which uses it, but that is not a requirement.
+ *
+ * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES,
+ * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+ * AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL
+ * HENRY SPENCER BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
+ * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
+ * WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR
+ * OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
+ * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*
- * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
- * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
- * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
- * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
- * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
- * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
- * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
- * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
- * SUCH DAMAGE.
+ * $Header: /cvsroot/pgsql/src/backend/regex/regcomp.c,v 1.36 2003/02/05 17:41:33 tgl Exp $
*
- * @(#)regcomp.c 8.5 (Berkeley) 3/20/94
*/
-#include "postgres.h"
-
-#include <ctype.h>
-#include <limits.h>
-#include <assert.h>
-
-#include "regex/regex.h"
-#include "regex/utils.h"
-#include "regex/regex2.h"
-#include "regex/cname.h"
-
-struct cclass
-{
- char *name;
- char *chars;
- char *multis;
-};
-static struct cclass *cclasses = NULL;
-static struct cclass *cclass_init(void);
+#include "regex/regguts.h"
/*
- * parse structure, passed up and down to avoid global variables and
- * other clumsinesses
+ * forward declarations, up here so forward datatypes etc. are defined early
*/
-struct parse
-{
- pg_wchar *next; /* next character in RE */
- pg_wchar *end; /* end of string (-> NUL normally) */
- int error; /* has an error been seen? */
- sop *strip; /* malloced strip */
- sopno ssize; /* malloced strip size (allocated) */
- sopno slen; /* malloced strip length (used) */
- int ncsalloc; /* number of csets allocated */
- struct re_guts *g;
-#define NPAREN 10 /* we need to remember () 1-9 for back
- * refs */
- sopno pbegin[NPAREN]; /* -> ( ([0] unused) */
- sopno pend[NPAREN]; /* -> ) ([0] unused) */
+/* === regcomp.c === */
+static void moresubs (struct vars *, int);
+static int freev (struct vars *, int);
+static void makesearch (struct vars *, struct nfa *);
+static struct subre *parse (struct vars *, int, int, struct state *, struct state *);
+static struct subre *parsebranch (struct vars *, int, int, struct state *, struct state *, int);
+static void parseqatom (struct vars *, int, int, struct state *, struct state *, struct subre *);
+static void nonword (struct vars *, int, struct state *, struct state *);
+static void word (struct vars *, int, struct state *, struct state *);
+static int scannum (struct vars *);
+static void repeat (struct vars *, struct state *, struct state *, int, int);
+static void bracket (struct vars *, struct state *, struct state *);
+static void cbracket (struct vars *, struct state *, struct state *);
+static void brackpart (struct vars *, struct state *, struct state *);
+static chr *scanplain (struct vars *);
+static void leaders (struct vars *, struct cvec *);
+static void onechr (struct vars *, chr, struct state *, struct state *);
+static void dovec (struct vars *, struct cvec *, struct state *, struct state *);
+static celt nextleader (struct vars *, chr, chr);
+static void wordchrs (struct vars *);
+static struct subre *subre (struct vars *, int, int, struct state *, struct state *);
+static void freesubre (struct vars *, struct subre *);
+static void freesrnode (struct vars *, struct subre *);
+static void optst (struct vars *, struct subre *);
+static int numst (struct subre *, int);
+static void markst (struct subre *);
+static void cleanst (struct vars *);
+static long nfatree (struct vars *, struct subre *, FILE *);
+static long nfanode (struct vars *, struct subre *, FILE *);
+static int newlacon (struct vars *, struct state *, struct state *, int);
+static void freelacons (struct subre *, int);
+static void rfree (regex_t *);
+#ifdef REG_DEBUG
+static void dump (regex_t *, FILE *);
+static void dumpst (struct subre *, FILE *, int);
+static void stdump (struct subre *, FILE *, int);
+static char *stid (struct subre *, char *, size_t);
+#endif
+/* === regc_lex.c === */
+static void lexstart (struct vars *);
+static void prefixes (struct vars *);
+static void lexnest (struct vars *, chr *, chr *);
+static void lexword (struct vars *);
+static int next (struct vars *);
+static int lexescape (struct vars *);
+static chr lexdigits (struct vars *, int, int, int);
+static int brenext (struct vars *, chr);
+static void skip (struct vars *);
+static chr newline (void);
+static chr chrnamed (struct vars *, chr *, chr *, chr);
+/* === regc_color.c === */
+static void initcm (struct vars *, struct colormap *);
+static void freecm (struct colormap *);
+static void cmtreefree (struct colormap *, union tree *, int);
+static color setcolor (struct colormap *, chr, pcolor);
+static color maxcolor (struct colormap *);
+static color newcolor (struct colormap *);
+static void freecolor (struct colormap *, pcolor);
+static color pseudocolor (struct colormap *);
+static color subcolor (struct colormap *, chr c);
+static color newsub (struct colormap *, pcolor);
+static void subrange (struct vars *, chr, chr, struct state *, struct state *);
+static void subblock (struct vars *, chr, struct state *, struct state *);
+static void okcolors (struct nfa *, struct colormap *);
+static void colorchain (struct colormap *, struct arc *);
+static void uncolorchain (struct colormap *, struct arc *);
+static int singleton (struct colormap *, chr c);
+static void rainbow (struct nfa *, struct colormap *, int, pcolor, struct state *, struct state *);
+static void colorcomplement (struct nfa *, struct colormap *, int, struct state *, struct state *, struct state *);
+#ifdef REG_DEBUG
+static void dumpcolors (struct colormap *, FILE *);
+static void fillcheck (struct colormap *, union tree *, int, FILE *);
+static void dumpchr (chr, FILE *);
+#endif
+/* === regc_nfa.c === */
+static struct nfa *newnfa (struct vars *, struct colormap *, struct nfa *);
+static void freenfa (struct nfa *);
+static struct state *newstate (struct nfa *);
+static struct state *newfstate (struct nfa *, int flag);
+static void dropstate (struct nfa *, struct state *);
+static void freestate (struct nfa *, struct state *);
+static void destroystate (struct nfa *, struct state *);
+static void newarc (struct nfa *, int, pcolor, struct state *, struct state *);
+static struct arc *allocarc (struct nfa *, struct state *);
+static void freearc (struct nfa *, struct arc *);
+static struct arc *findarc (struct state *, int, pcolor);
+static void cparc (struct nfa *, struct arc *, struct state *, struct state *);
+static void moveins (struct nfa *, struct state *, struct state *);
+static void copyins (struct nfa *, struct state *, struct state *);
+static void moveouts (struct nfa *, struct state *, struct state *);
+static void copyouts (struct nfa *, struct state *, struct state *);
+static void cloneouts (struct nfa *, struct state *, struct state *, struct state *, int);
+static void delsub (struct nfa *, struct state *, struct state *);
+static void deltraverse (struct nfa *, struct state *, struct state *);
+static void dupnfa (struct nfa *, struct state *, struct state *, struct state *, struct state *);
+static void duptraverse (struct nfa *, struct state *, struct state *);
+static void cleartraverse (struct nfa *, struct state *);
+static void specialcolors (struct nfa *);
+static long optimize (struct nfa *, FILE *);
+static void pullback (struct nfa *, FILE *);
+static int pull (struct nfa *, struct arc *);
+static void pushfwd (struct nfa *, FILE *);
+static int push (struct nfa *, struct arc *);
+#define INCOMPATIBLE 1 /* destroys arc */
+#define SATISFIED 2 /* constraint satisfied */
+#define COMPATIBLE 3 /* compatible but not satisfied yet */
+static int combine (struct arc *, struct arc *);
+static void fixempties (struct nfa *, FILE *);
+static int unempty (struct nfa *, struct arc *);
+static void cleanup (struct nfa *);
+static void markreachable (struct nfa *, struct state *, struct state *, struct state *);
+static void markcanreach (struct nfa *, struct state *, struct state *, struct state *);
+static long analyze (struct nfa *);
+static void compact (struct nfa *, struct cnfa *);
+static void carcsort (struct carc *, struct carc *);
+static void freecnfa (struct cnfa *);
+static void dumpnfa (struct nfa *, FILE *);
+#ifdef REG_DEBUG
+static void dumpstate (struct state *, FILE *);
+static void dumparcs (struct state *, FILE *);
+static int dumprarcs (struct arc *, struct state *, FILE *, int);
+static void dumparc (struct arc *, struct state *, FILE *);
+static void dumpcnfa (struct cnfa *, FILE *);
+static void dumpcstate (int, struct carc *, struct cnfa *, FILE *);
+#endif
+/* === regc_cvec.c === */
+static struct cvec *newcvec (int, int, int);
+static struct cvec *clearcvec (struct cvec *);
+static void addchr (struct cvec *, chr);
+static void addrange (struct cvec *, chr, chr);
+static void addmcce (struct cvec *, chr *, chr *);
+static int haschr (struct cvec *, chr);
+static struct cvec *getcvec (struct vars *, int, int, int);
+static void freecvec (struct cvec *);
+/* === regc_locale.c === */
+static int pg_isdigit(pg_wchar c);
+static int pg_isalpha(pg_wchar c);
+static int pg_isalnum(pg_wchar c);
+static int pg_isupper(pg_wchar c);
+static int pg_islower(pg_wchar c);
+static int pg_isgraph(pg_wchar c);
+static int pg_ispunct(pg_wchar c);
+static int pg_isspace(pg_wchar c);
+static pg_wchar pg_toupper(pg_wchar c);
+static pg_wchar pg_tolower(pg_wchar c);
+static int nmcces (struct vars *);
+static int nleaders (struct vars *);
+static struct cvec *allmcces (struct vars *, struct cvec *);
+static celt element (struct vars *, chr *, chr *);
+static struct cvec *range (struct vars *, celt, celt, int);
+static int before (celt, celt);
+static struct cvec *eclass (struct vars *, celt, int);
+static struct cvec *cclass (struct vars *, chr *, chr *, int);
+static struct cvec *allcases (struct vars *, chr);
+static int cmp (const chr *, const chr *, size_t);
+static int casecmp (const chr *, const chr *, size_t);
+
+
+/* internal variables, bundled for easy passing around */
+struct vars {
+ regex_t *re;
+ chr *now; /* scan pointer into string */
+ chr *stop; /* end of string */
+ chr *savenow; /* saved now and stop for "subroutine call" */
+ chr *savestop;
+ int err; /* error code (0 if none) */
+ int cflags; /* copy of compile flags */
+ int lasttype; /* type of previous token */
+ int nexttype; /* type of next token */
+ chr nextvalue; /* value (if any) of next token */
+ int lexcon; /* lexical context type (see lex.c) */
+ int nsubexp; /* subexpression count */
+ struct subre **subs; /* subRE pointer vector */
+ size_t nsubs; /* length of vector */
+ struct subre *sub10[10]; /* initial vector, enough for most */
+ struct nfa *nfa; /* the NFA */
+ struct colormap *cm; /* character color map */
+ color nlcolor; /* color of newline */
+ struct state *wordchrs; /* state in nfa holding word-char outarcs */
+ struct subre *tree; /* subexpression tree */
+ struct subre *treechain; /* all tree nodes allocated */
+ struct subre *treefree; /* any free tree nodes */
+ int ntree; /* number of tree nodes */
+ struct cvec *cv; /* interface cvec */
+ struct cvec *cv2; /* utility cvec */
+ struct cvec *mcces; /* collating-element information */
+# define ISCELEADER(v,c) (v->mcces != NULL && haschr(v->mcces, (c)))
+ struct state *mccepbegin; /* in nfa, start of MCCE prototypes */
+ struct state *mccepend; /* in nfa, end of MCCE prototypes */
+ struct subre *lacons; /* lookahead-constraint vector */
+ int nlacons; /* size of lacons */
};
-static void p_ere(struct parse * p, int stop);
-static void p_ere_exp(struct parse * p);
-static void p_str(struct parse * p);
-static void p_bre(struct parse * p, int end1, int end2);
-static int p_simp_re(struct parse * p, int starordinary);
-static int p_count(struct parse * p);
-static void p_bracket(struct parse * p);
-static void p_b_term(struct parse * p, cset *cs);
-static void p_b_cclass(struct parse * p, cset *cs);
-static void p_b_eclass(struct parse * p, cset *cs);
-static pg_wchar p_b_symbol(struct parse * p);
-static char p_b_coll_elem(struct parse * p, int endc);
-static unsigned char othercase(int ch);
-static void bothcases(struct parse * p, int ch);
-static void ordinary(struct parse * p, int ch);
-static void nonnewline(struct parse * p);
-static void repeat(struct parse * p, sopno start, int from, int to);
-static int seterr(struct parse * p, int e);
-static cset *allocset(struct parse * p);
-static void freeset(struct parse * p, cset *cs);
-static int freezeset(struct parse * p, cset *cs);
-static int firstch(struct parse * p, cset *cs);
-static int nch(struct parse * p, cset *cs);
-static void mcadd(struct parse * p, cset *cs, char *cp);
-static void mcinvert(struct parse * p, cset *cs);
-static void mccase(struct parse * p, cset *cs);
-static int isinsets(struct re_guts * g, int c);
-static int samesets(struct re_guts * g, int c1, int c2);
-static void categorize(struct parse * p, struct re_guts * g);
-static sopno dupl(struct parse * p, sopno start, sopno finish);
-static void doemit(struct parse * p, sop op, size_t opnd);
-static void doinsert(struct parse * p, sop op, size_t opnd, sopno pos);
-static void dofwd(struct parse * p, sopno pos, sop value);
-static void enlarge(struct parse * p, sopno size);
-static void stripsnug(struct parse * p, struct re_guts * g);
-static void findmust(struct parse * p, struct re_guts * g);
-static sopno pluscount(struct parse * p, struct re_guts * g);
-static int pg_isdigit(int c);
-static int pg_isalpha(int c);
-static int pg_isalnum(int c);
-static int pg_isupper(int c);
-static int pg_islower(int c);
-static int pg_iscntrl(int c);
-static int pg_isgraph(int c);
-static int pg_isprint(int c);
-static int pg_ispunct(int c);
-
-static pg_wchar nuls[10]; /* place to point scanner in event of
- * error */
+/* parsing macros; most know that `v' is the struct vars pointer */
+#define NEXT() (next(v)) /* advance by one token */
+#define SEE(t) (v->nexttype == (t)) /* is next token this? */
+#define EAT(t) (SEE(t) && next(v)) /* if next is this, swallow it */
+#define VISERR(vv) ((vv)->err != 0) /* have we seen an error yet? */
+#define ISERR() VISERR(v)
+#define VERR(vv,e) ((vv)->nexttype = EOS, ((vv)->err) ? (vv)->err :\
+ ((vv)->err = (e)))
+#define ERR(e) VERR(v, e) /* record an error */
+#define NOERR() {if (ISERR()) return;} /* if error seen, return */
+#define NOERRN() {if (ISERR()) return NULL;} /* NOERR with retval */
+#define NOERRZ() {if (ISERR()) return 0;} /* NOERR with retval */
+#define INSIST(c, e) ((c) ? 0 : ERR(e)) /* if condition false, error */
+#define NOTE(b) (v->re->re_info |= (b)) /* note visible condition */
+#define