Suffix Trees

最新推荐文章于 2020-08-29 23:44:08 发布

karlajax

最新推荐文章于 2020-08-29 23:44:08 发布

阅读量1.1k

点赞数

文章标签： tree haskell algorithm character pair reference

Suffix Trees

LA home
Computing
Algorithms
glossary
Binary Trees
  Search T'
  Tries
  PATRICIA
  Suffix Trees
Tables
  Suffix T.
   Woolloomooloo

also see
Strings
  Suffix array

e.g. Given the string `mississippi',
`miss' is a prefix, `ippi' is a suffix, & `issi' is a substring. Note that a substring is a prefix of a suffix, and a suffix of a prefix.

A Suffix Tree is a data-structure that allows many problems on strings (sequences of characters) to be solved quickly. If txt=t₁t₂...t_i...t_n is a string, then T_i=t_it_i+1...t_n is the suffix of txt that starts at position i, e.g.

T₁  = mississippi = txt
T₂  = ississippi
T₃  = ssissippi
T₄  = sissippi
T₅  = issippi
T₆  = ssippi
T₇  = sippi
T₈  = ippi
T₉  = ppi
T₁₀ = pi
T₁₁ = i
T₁₂ =               (empty)

The suffix tree for `txt' is a Trie-like or PATRICIA-like data structure that represents the suffixes of txt.

A given suffix tree can be used to search for a substring, pat[1..m] in O(m) time. There are n(n+1)/2 substrings in txt[1..n] so it is rather surprising that a suffix tree can be built in O(n) time. Adding just one character to txt increases the number of substrings by n+1, but they are not independent. Weiner (1973) gave the first algorithm and McCreight (1976) gave a more readable account for constructing the suffix tree while processing txt from right to left. Only much later did Ukkonen (1992, 1995) give a left-to-right on-line algorithm, i.e. an algorithm that maintains a suffix tree for txt[1..i] at each step as i is increased from 1 to n.

If the non-empty suffixes are sorted:

T₁₁ = i
T₈  = ippi
T₅  = issippi
T₂  = ississippi
T₁  = mississippi
T₁₀ = pi
T₉  = ppi
T₇  = sippi
T₄  = sissippi
T₆  = ssippi
T₃  = ssissippi

it becomes obvious that some of them (may) share common prefixes. Here there are substrings starting with `i', `m', `p' and `s', but all of those starting `is', in fact start `issi'. Two or more common prefixes share a common path from the root of the suffix tree (as in a PATRICIA tree). Now, a search (sub)string pat must be a prefix of a suffix of txt, if it occurs in txt.

           tree                       substrings

tree-->|---mississippi                m .. mississippi
       |
       |---i-->|---ssi-->|---ssippi   i .. ississippi
       |       |         |
       |       |         |---ppi      issip,issipp,issippi
       |       |
       |       |---ppi                ip, ipp, ippi
       |
       |---s-->|---si-->|---ssippi    s .. ssissippi
       |       |        |
       |       |        |---ppi       ssip, ssipp, ssippi
       |       |
       |       |---i-->|---ssippi     si .. sissippi
       |               |
       |               |---ppi        sip, sipp, sippi
       |
       |---p-->|---pi                 p, pp, ppi
               |
               |---i                  p, pi

--- Suffix Tree for "mississippi" 
           ---

Each edge (arc) of the suffix tree is labelled with a substring of txt which is implemented by pointers to the start and end of the substring, e.g. `ssi' by <3,5>. One of the observation in Ukkonen's algorithm is that an edge, <i,n>, leading to a leaf can be implemented by <i,∞> where `∞', i.e. infinity, means `to the end of the string'.

Suffix Tree Demonstration

Change the Text txt=... in the HTML FORM below, and click on `go'; experiment with different text strings:

NB. If the string is "short", a simple sort routine is run first to sort the suffices the slow way for comparison with the tree; this is not done if the string is "long".

If the termination of txt is important, this can be indicated by a special terminating character often denoted by `$' in papers on strings (~zero char in C/Unix).

Building a Suffix Tree, (a) Slowly

We show how a suffix tree might be built "by hand". Three dots, `...', are used to show the current end of any suffix that will grow as more characters are processed. Starting with the empty suffix tree, consider the string `m':

           tree₁

tree-->----m...

Adding the second character to get `mi' there are now suffixes `mi' and `i':

           tree₂

tree-->|---mi...
       |
       |---i...

Next `mis'

           tree₃

tree-->|---mis...
       |
       |---is...
       |
       |---s...

There is no need to add any more splits for `miss' because `s' is part of `ss'.

           tree₄

tree-->|---miss...
       |
       |---iss...
       |
       |---ss...

However, with `missi' there must be a split because one `s' is followed by `i', the other by `s'

           tree₅

tree-->|---missi...
       |
       |---issi...
       |
       |---s-->|---si...
               |
               |---i...

The 6th character, `s', brings us to `missis' and no split because both `i's are followed by `s's.

           tree₆

tree-->|---missis...
       |
       |---issis...
       |
       |---s-->|---sis...
               |
               |---is...

`mississ'

           tree₇

tree-->|---mississ...
       |
       |---ississ...
       |
       |---s-->|---siss...
               |
               |---iss...

`mississi'

           tree₈

tree-->|---mississi...
       |
       |---ississi...
       |
       |---s-->|---sissi...
               |
               |---issi...

A lot suddenly happens for `mississip', because it brings the first `p', and causes the third `i' to be followed by `p' where the other two are followed by `ssi'. Consequently one of the `ssi' is followed by `p', the other by `ssip', ditto `si'.

           tree₉

tree-->|---mississip...
       |
       |---i-->|---ssi-->|---ssip...
       |       |         |
       |       |         |---p...
       |       |
       |       |---p...
       |
       |---s-->|---si-->|---ssip...
       |       |        |
       |       |        |---p...
       |       |
       |       |---i-->|---ssip...
       |               |
       |               |---p...
       |
       |---p...

By comparison `mississipp' is very quiet

           tree₁₀

tree-->|---mississipp...
       |
       |---i-->|---ssi-->|---ssipp...
       |       |         |
       |       |         |---pp...
       |       |
       |       |---pp...
       |
       |---s-->|---si-->|---ssipp...
       |       |        |
       |       |        |---pp...
       |       |
       |       |---i-->|---ssipp...
       |               |
       |               |---pp...
       |
       |---pp...

`mississippi' is an anti-climax

           tree₁₁

tree-->|---mississippi
       |
       |---i-->|---ssi-->|---ssippi
       |       |         |
       |       |         |---ppi
       |       |
       |       |---ppi
       |
       |---s-->|---si-->|---ssippi
       |       |        |
       |       |        |---ppi
       |       |
       |       |---i-->|---ssippi
       |               |
       |               |---ppi
       |
       |---p-->|---pi
               |
               |---i

and we are done. The challenge, to a computer scienctist, is to make sure tree_i is updated to tree_i+1 efficiently. This can be done (Ukkonen 1992, 1995) so that tree_n can be built, starting from tree₀, in O(n)-time overall.

(b) Faster

The following terminology is adapted from Ukkonen (1995).

If `x' is a substring of txt then `x' represents the state (i.e. location) in the suffix-tree found by tracing out the characters of x from the root. Note that x might be part-way along an edge of the tree.
A vertex (node) of the suffix-tree is called an explicit state.
A substring x=txt[L..R] can be represented by (L,R).
If `v' is a vertex of the suffix-tree, the pair `(v,x)', equivalently (v,(L,R)), represents the state (location) in the suffix-tree found by tracing out the characters of x from v.
(v,x) is canonical if v is the last explit state on the path from v to (v,x). NB. (v,empty) is canonical.
A special vertex called `bottom' is added and is denoted _|_.

The transition function, g( ), is defined as follows:

g(_|_, a) = root, for all characters `a'.

g( x, a) = y where y=xa, for character `a'.

f( ):

f(root)=_|_

f( x)= y, if x~=empty and x=ay

The suffix function f'( ) is defined as follows:

f'(root)=_|_.

If vertex v= x where x~=empty then f'(v)= y where x=ay for some character `a' and substring y (possibly empty).

The boundary path s ₁, s ₂, ..., s _i, s _i+1 of suffix-tree _i-1:

s ₁= (1,i-1), i.e. the state corresponding to txt[1..i-1]

s ₂= (2,i-1)

...

s _i=root

s _i+1=_|_

The active point is the first s _j on the boundary path that is not a leaf, and the end-point is the first s _j' that has a txt[i]-transition.

When tree_i-1 is expanded into tree_i, character txt[i] must be dealt with. This is done during a traversal of the boundary path. Any state on the boundary path before s_j is a leaf and could be extended by adding txt[i] to the incoming arc, but this can be done for free by representing arcs to leaves by (L,∞) where `∞' is `infinity'. So it it is only necessary to examine states from the active point s_j and prior to the end-point s_j' .

"[states from s_j and before s_j' create entirely new branches that start from states s_h, j<=h<j'. ... They are found along the boundary path of [tree_i-1] using reference pairs and suffix links." - Ukkonen (1995) .


// almost 
           
            JavaScript (try view-source)

function upDate(s, k, i)
// (s, (k, i-1)) is the canonical reference pair for the active point
 { var oldr = root;
   var (endPoint, r) = test_and_split(s, k, i-1, Txt.charAt(i));

   while (!endPoint)
    { r.addTransition(i, infinity, new State());
      if (oldr != root) oldr.sLink = r; // build suffix-link active-path

      oldr = r;
      var (s,k) = canonize(s.sLink, k, i-1)
      (endPoint, r) = test_and_split(s, k, i-1, Txt.charAt(i))
    }

   if(oldr != root) oldr.sLink = s;

   return new pair(s, k);
 }//upDate

Note that r.addTransition(...) adds an edge from state r, labelling the edge with a substring. New txt[i]-transitions must be "open" transitions of the form (L,∞).

Where necessary, test_and_split(...) replaces edges s--->s1 with s--->r--->s1 for a new node r. This makes r=(s,(k,p)) explicit.

function test_and_split(s, k, p, t)
 { if(k<=p)
    { // find the t_k transition g'(s,(k',p'))=s' from s
      // k1 is k'  p1 is p' in Ukkonen '95
      var ((k1,p1), s1)  = s[Txt.charAt(k)];

      if (t == Txt.charAt(k1 + p - k + 1))
         return new pair(true, s);
      else
       { var r = new State()
         s.addTransition(k1, k1+p-k,   r);     // s---->r---->s1
         r.addTransition(    k1+p-k+1, p1, s1);
         return new pair(false, r)
       }
    }
   else // k > p;  ? is there a t-transition from s ?
      return new pair(s[t] != null, s);
 }//test_and_split

Canonize(...) takes (s,w)=(s,(k,p)) and steps over intermediate nodes by spelling out the characters of w=txt[k..p] for as far as possible.

function canonize(s, k, p)    // s--->...
 { if(p < k) return new pair (s, k);

   // find the t_k transition g'(s,(k',p'))=s' from s
   // k1 is k',  p1 is p' in Ukk' '95
   var ((k1,p1), s1) = s[Txt.charAt(k)];     // s--(k1,p1)-->s1

   while(p1-k1 <= p-k)                       // s--(k1,p1)-->s1--->...
    { k += p1 - k1 + 1;  // remove |(k1,p1)| chars from front of (k,p)
      s = s1;
      if(k <= p)
       { ((k1,p1), s1) = s[Txt.charAt(k)];   // s--(k1,p1)-->s1
       }
    }
   return new pair(s, k);
 }//canonize

The main controlling routine repeatedly takes the next character, updates the sites on the active path and finds and canonizes the new active point:

function ukkonen95()// construct suffix tree for Txt[0..N-1]
 { var s, k, i;
   var bt;

   root = new State();
   bt = new State();                            // bt (bottom or _|_)

   // Want to create transitions for all possible chars
   // from bt to root
   for (i=0; i < Txt.length; i++)
      bt.addTransition(i,i, root);

   root.sLink = bt;
   s=root; k=0;    // NB. k=0, unlike Ukkonen our strings are 0 based

   for(i=0; i < Txt.length; i++)
    { var (s,k) = upDate(s, k, i);   // follow path from active-point
      (s,k) = canonize(s, k, i);
    }
 }//ukkonen95

It relies upon the fact ( lemma 2 Ukkonen (1995) ) that if (s,(k,i-1)) is a reference pair for the end-point, s_j' , of tree_i-1 then (s,(k,i)) is a reference pair for the active point of tree_i.

Suffix Tree Applications

Suffix Trees can be used to solve a large number of string problems that occur in text-editing, free-text search, computational biology, and other application areas. Some examples are given below.

String Search

Searching for a substring, pat[1..m], in txt[1..n], can be solved in O(m) time (after the suffix tree for txt has been built in O(n) time).

Longest Repeated Substring

Add a special ``end of string'' character, e.g. `$', to txt[1..n] and build a suffix tree; the longest repeated substring of txt[1..n] is indicated by the deepest fork node in the suffix tree, where depth is measured by the number of characters traversed from the root, i.e. `issi' in the case of `mississippi'. The longest repeated substring can be found in O(n) time using a suffix tree.

Longest Common Substring

The longest common substring of two strings, txt₁ and txt₂, can be found by building a generalized suffix tree for txt₁ and txt₂: Each node is marked to indicate if it represents a suffix of txt₁ or txt₂ or both. The deepest node marked for both txt₁ and txt₂ represents the longest common substring.

Equivalently, one can build a (basic) suffix tree for the string txt₁$txt₂#, where `$' is a special terminator for txt₁ and `#' is a special terminator for txt₂. The longest common substring is indicated by the deepest fork node that has both `...$...' and `...#...' (no $) beneath it.
(Try it using the HTML FORM above.)

Note that the `longest common substring problem' is different to the `longest common subsequence problem' which is closely related to the `edit-distance problem': An instance of a subsequence can have gaps where it appears in txt₁ and in txt₂, but an instance of a substring cannot have gaps.

Palindromes

A palindrome is a string, P, such that P=reverse(P). e.g. `abba'=reverse(`abba'). e.g. `ississi' is the longest palindrome in `mississippi'.

The longest palindrome of txt[1..n] can be found in O(n) time, e.g. by building the suffix tree for txt$reverse(txt)# or by building the generalized suffix tree for txt and reverse(txt).
(Try it.)

C
o
m
p
u
t
e
r
-
S
c
i
e
n
c
e

Suffix Tree Notes

Significant developments are due to Weiner (1973), McCreight (1976) and Ukkonen (1992,1995).

See also Tries and PATRICIA trees, but remember that these are for implementing lookup-tables containing many, usually short, strings rather than for finding substrings of one long text.
E. M. McCreight. A Space-Economical Suffix Tree Construction Algorithm. Jrnl. of Algorithms, 23(2) pp262-272, 1976.
Credited with making Weiner's suffix trees (1973) more efficient and (a little) more accessible. Algorithm still right-to-left.
E. Ukkonen. Constructing Suffix Trees On-Line in Linear Time. In Algorithms, Software, Architecture, J.v.Leeuwen (ed) , vol#1 of Information Processing 92 , Proc. IFIP 12th World Computer Congress, Madrid, Spain, Elsevier Sci. Publ., pp484-492, 1992.
Gives an on-line algorithm, i.e. it processes the input text incrementally from left to right and at each stage it has a suffix tree for the part of the text that has been processed so far. But see note about "inaccuracies" below!
E. Ukkonen. On-line Construction of Suffix Trees. Algorithmica, 14(3) pp249-260, 1995.
p260: "J.Karkkainen pointed out some inaccuracies in the earlier [1992] version of this work."
P. Weiner. Linear Pattern Matching Algorithms. Proc. 14th IEEE Annual Symp. on Switching and Automata Theory, pp1-11, 1973.
Extremely important paper introducing suffix trees and hence fast ways of solving many problems on strings.
Also see [String Searching].

www

The C++ Programming Language
Stroustrup

The Art of Computer Programming, Vols.1-3
Donald Knuth 7us

The Darwin Awards 4:

free:
Linux operating-sys
OpenOffice office-suite, ver. 2.4+
The GIMP ~photoshop
Firefox web browser
FlashBlock flash on/off

© L. Allison http://www.allisons.org/ll/ ( or as otherwise indicated),
Created with "vi (Linux + Solaris)", charset=iso-8859-1, fetched Tuesday, 14-Oct-2008 13:40:25 EST.

网址：http://www.allisons.org/ll/AlgDS/Tree/Suffix/