Rectangle 27 0

there's an implementation of Ukkonen's linear construction of suffix trees (plus suffix arrays, lcp array) here: http://code.google.com/p/text-indexing/ . the visualization provided along with the suffixtree.js may help

Note that link-only answers are discouraged, SO answers should be the end-point of a search for a solution (vs. yet another stopover of references, which tend to get stale over time). Please consider adding a stand-alone synopsis here, keeping the link as a reference.

hey .. you were already advised to not post link-only answers, why don't you learn?

algorithm - really hard to understand suffix tree - Stack Overflow

algorithm suffix-tree
Rectangle 27 0

Construct a suffix tree of the large string -- the construction of the tree is O(n) where n is the length of the large string.

Now you can find the location of any of the substrings in O(m) time (where m is the length of the substring) by simply following the the substring down through the tree -- the node or leaf where the substring ends will hold the key corresponding to the index into the large string.

Go through the set of substrings finding their location in the big string, keeping track of the minimum index.

I edited the above description.

can you use hashtable or dictionary instead of prefix tree?

Sure -- but the choice depends on the application. If you are doing combinations of insertions, lookups and deletions, then the prefix tree is probably the best solution. Otherwise a hashtable can provide very good results. Again, it depends on the application.

Also, tries facilitate longest-prefix matching, but hashing does not

algorithm - getting a string from a list of substrings - Stack Overflo...

string algorithm substring string-matching
Rectangle 27 0

Related Topics:

References

This article is contributed by Anurag Singh. Please write comments if you find anything incorrect, or you want to share more information about the topic discussed above

Writing code in comment?

Rectangle 27 0

Related Topics:

References

This article is contributed by Anurag Singh. Please write comments if you find anything incorrect, or you want to share more information about the topic discussed above

Writing code in comment?

Rectangle 27 0

*********************Phase 1*********************************

At the end of phase 1, remainingSuffixCount is ZERO (All suffixes are added explicitly). Figure 20 in Part 3 is the resulting tree after phase 1.

Rectangle 27 0

Related Topics:

References

This article is contributed by Anurag Singh. Please write comments if you find anything incorrect, or you want to share more information about the topic discussed above

Writing code in comment?

Rectangle 27 0

static void Main(string[] args)
    {

        //XmlDocument doc = new XmlDocument();
        //XmlElement newBook=doc.CreateElement("BookParticipant");
        //newBook.SetAttribute("Author");

        //Using Functional Construction to Create an XML Schema
        XElement xBookParticipant = new XElement("BookParticipant",
                                        new XElement("FirstName", "Joe"),
                                        new XElement("LastName", "Rattz"));
        Console.WriteLine(xBookParticipant.ToString());


        //Creates the Same XML Tree as Listing 6-1 but with Far Less Code
        XElement xBookParticipants = new XElement("BookParticipants",
                                        new XElement("BookParticipant",
                                        new XAttribute("type", "Author"),
                                        new XElement("FirstName", "Joe"),
                                        new XElement("LastName", "Rattz")),
                                        new XElement("BookParticipant",
                                        new XAttribute("type", "Editor"),
                                        new XElement("FirstName", "Ewan"),
                                        new XElement("LastName", "Buckingham")));
        Console.WriteLine(xBookParticipants.ToString());


        //-- Disadvatages of XML document
        //System.Xml.XmlElement xmlBookParticipant = new System.Xml.XmlElement("BookParticipant");
        XElement xeBookParticipant = new XElement("BookParticipant");


        XDocument xDocument = new XDocument(new XElement("BookParticipants",
                                            new XElement("BookParticipant",
                                            new XAttribute("type", "Author"),
                                            new XElement("FirstName", "Joe"),
                                            new XElement("LastName", "Rattz"))));
        Console.WriteLine(xDocument.ToString());


        //--Calling the ToString Method on an Element Produces the XML Tree
        XElement name = new XElement("Name", "Joe");
        Console.WriteLine(name.ToString());

        //--Console.WriteLine Implicitly Calling the ToString Method on an Element to Produce an XML Tree


        XElement name1 = new XElement("Person",
                                      new XElement("FirstName", "Joe"),
                                      new XElement("LastName", "Rattz"));
        Console.WriteLine(name1);

        //-- Casting an Element to Its Values Data Type Outputs the Value
        Console.WriteLine(name);
        Console.WriteLine((string)name);

        //--Different Node Value Types Retrieved via Casting to the Node Values Type
        XElement count = new XElement("Count", 12);
        Console.WriteLine(count);
        Console.WriteLine((int)count);

        XElement smoker = new XElement("Smoker", false);
        Console.WriteLine(smoker);
        Console.WriteLine((bool)smoker);

        XElement pi = new XElement("Pi", 3.1415926535);
        Console.WriteLine(pi);
        Console.WriteLine((double)pi);


        DeferredQryProblem();


        GenerateXMlFromLinqQry();



    }

    private static void DeferredQryProblem()
    {
        XDocument xDocument = new XDocument(
                    new XElement("BookParticipants",
                    new XElement("BookParticipant",
                    new XAttribute("type", "Author"),
                    new XElement("FirstName", "Joe"),
                    new XElement("LastName", "Rattz")),
                    new XElement("BookParticipant",
                    new XAttribute("type", "Editor"),
                    new XElement("FirstName", "Ewan"),
                    new XElement("LastName", "Buckingham"))));
        IEnumerable<XElement> elements =
        xDocument.Element("BookParticipants").Elements("BookParticipant");
        foreach (XElement element in elements)
        {
            Console.WriteLine("Source element: {0} : value = {1}",
            element.Name, element.Value);
        }
        foreach (XElement element in elements)
        {
            Console.WriteLine("Removing {0} = {1} ...", element.Name, element.Value);
            element.Remove();
        }
        Console.WriteLine(xDocument);


        foreach (XElement element in elements)
        {
            Console.WriteLine("Source element: {0} : value = {1}",
            element.Name, element.Value);
        }
        foreach (XElement element in elements.ToArray())
        {
            Console.WriteLine("Removing {0} = {1} ...", element.Name, element.Value);
            element.Remove();
        }
        Console.WriteLine(xDocument);
    }

    //-- Creating an Attribute and Adding It to Its Element
    private static void CreatingAttribute()
    {
        XElement xBookParticipant = new XElement("BookParticipant", new XAttribute("type", "Author"));
        Console.WriteLine(xBookParticipant);
    }

    //--Creating a Comment with Functional Construction
    private static void CreatingComment()
    {
        XElement xBookParticipant = new XElement("BookParticipant",
                                                  new XComment("This person is retired."));
        Console.WriteLine(xBookParticipant);
    }

    //--Creating a Declaration with Functional Construction
    private static void CreateXmlDeclaration()
    {
        XDocument xDocument = new XDocument(new XDeclaration("1.0", "UTF-8", "yes"),
new XElement("BookParticipant"));
        Console.WriteLine(xDocument);
    }

    private static void GenerateXMlFromLinqQry()
    {
        BookParticipant[] bookParticipants = new[] {new BookParticipant {FirstName = "Joe", LastName = "Rattz",
                                                    ParticipantType = ParticipantTypes.Author},
                                                    new BookParticipant {FirstName = "Ewan", LastName = "Buckingham",
                                                    ParticipantType = ParticipantTypes.Editor}
                                                    };
        XElement xBookParticipants =
        new XElement("BookParticipants",
                     bookParticipants.Select(p =>
                new XElement("BookParticipant",
                new XAttribute("type", p.ParticipantType),
                new XElement("FirstName", p.FirstName),
                new XElement("LastName", p.LastName))));


        Console.WriteLine(xBookParticipants);
    }

    //-- Obtaining Elements Without Reaching
    private static void WithoutReaching()
    {
        XDocument xDocument = new XDocument(new XElement("BookParticipants",
                                            new XElement("BookParticipant",
                                            new XAttribute("type", "Author"),
                                            new XElement("FirstName", "Joe"),
                                            new XElement("LastName", "Rattz")),
                                            new XElement("BookParticipant",
                                            new XAttribute("type", "Editor"),
                                            new XElement("FirstName", "Ewan"),
                                            new XElement("LastName", "Buckingham"))));
        IEnumerable<XElement> elements = xDocument.Descendants("BookParticipant");
        foreach (XElement element in elements)
        {
            Console.WriteLine("Element: {0} : value = {1}",
            element.Name, element.Value);
        }


        IEnumerable<XElement> elements1 = xDocument.Descendants("BookParticipant")
                                                   .Where(e => ((string)e.Element("FirstName")) == "Ewan");
        foreach (XElement element1 in elements1)
        {
            Console.WriteLine("Element: {0} : value = {1}",
            element1.Name, element1.Value);
        }

    }

c# - Linq to XML - update/alter the nodes of an XML Document - Stack O...

c# xml linq linq-to-xml
Rectangle 27 0

Find a copy of Gusfield's string algorithms textbook. It's got the best exposition of the suffix tree construction I've seen. The linearity is a surprising consequence of a number of optimizations of the high-level algorithm.

Is there an animated version of this ukkonen algo? Sorry i couldn't understand the constant nature of "update" function. I tried gusfield, ukkonen's paper and google too :D

I would love to watch a animation for building a generalized suffix tree in linear time. Generally text based explanation wont fit in my mind..:)

Gusfields chapter on suffix trees has this annoying feature, that he uses different strings to illustrate the different steps - so you loose the big picture.

complexity theory - Understanding Ukkonen's algorithm for suffix trees...

algorithm complexity-theory suffix-tree
Rectangle 27 0

Related Topics:

References

This article is contributed by Anurag Singh. Please write comments if you find anything incorrect, or you want to share more information about the topic discussed above

Writing code in comment?

Rectangle 27 0

Ukkonens Suffix Tree Construction Part 2

In Ukkonens Suffix Tree Construction Part 1, we have seen high level Ukkonens Algorithm. This 2nd part is continuation of Part 1. Please go through Part 1, before looking at current article.

In Suffix Tree Construction of string S of length m, there are m phases and for a phase j (1 <= j <= m), we add jth character in tree built so far and this is done through j extensions. All extensions follow one of the three extension rules (discussed in Part 1).

To do jth extension of phase i+1 (adding character S[i+1]), we first need to find end of the path from the root labelled S[j..i] in the current tree. One way is start from root and traverse the edges matching S[j..i] string. This will take O(m3) time to build the suffix tree. Using few observations and implementation tricks, it can be done in O(m) which we will see now.

Suffix links For an internal node v with path-label xA, where x denotes a single character and A denotes a (possibly empty) substring, if there is another node s(v) with path-label A, then a pointer from v to s(v) is called a suffix link. If A is empty string, suffix link from internal node will go to root node. There will not be any suffix link from root node (As its not considered as internal node).

In extension j of some phase i, if a new internal node v with path-label xA is added, then in extension j+1 in the same phase i:

In extension j+1 of same phase i, we will create a suffix link from the internal node created in jth extension to the node with path labelled A.

So in a given phase, any newly created internal node (with path-label xA) will have a suffix link from it (pointing to another node with path-label A) by the end of the next extension.

In any implicit suffix tree Ti after phase i, if internal node v has path-label xA, then there is a node s(v) in Ti with path-label A and node v will point to node s(v) using suffix link.

At any time, all internal nodes in the changing tree will have suffix links from them to another internal node (or root) except for the most recently added internal node, which will receive its suffix link by the end of the next extension.

How suffix links are used to speed up the implementation? In extension j of phase i+1, we need to find the end of the path from the root labelled S[j..i] in the current tree. One way is start from root and traverse the edges matching S[j..i] string. Suffix links provide a short cut to find end of the path.

So we can see that, to find end of path S[j..i], we need not traverse from root. We can start from the end of path S[j-1..i], walk up one edge to node v (i.e. go to parent node), follow the suffix link to s(v), then walk down the path y (which is abcd here in Figure 17). This shows the use of suffix link is an improvement over the process. Note: In the next part 3, we will introduce activePoint which will help to avoid walk up. We can directly go to node s(v) from node v.

When there is a suffix link from node v to node s(v), then if there is a path labelled with string y from node v to a leaf, then there must be a path labelled with string y from node s(v) to a leaf. In Figure 17, there is a path label abcd from node v to a leaf, then there is a path will same label abcd from node s(v) to a leaf. This fact can be used to improve the walk from s(v) to leaf along the path y. This is called skip/count trick.

Skip/Count Trick When walking down from node s(v) to leaf, instead of matching path character by character as we travel, we can directly skip to the next node if number of characters on the edge is less than the number of characters we need to travel. If number of characters on the edge is more than the number of characters we need to travel, we directly skip to the last character on that edge. If implementation is such a way that number of characters on any edge, character at a given position in string S should be obtained in constant time, then skip/count trick will do the walk down in proportional to the number of nodes on it rather than the number of characters on it.

Using suffix link along with skip/count trick, suffix tree can be built in O(m2) as there are m phases and each phase takes O(m).

Edge-label compression So far, path labels are represented as characters in string. Such a suffix tree will take O(m2) space to store the path labels. To avoid this, we can use two pair of indices (start, end) on each edge for path labels, instead of substring itself. The indices start and end tells the path label start and end position in string S. With this, suffix tree needs O(m) space.

Ukkonen's Suffix Tree Construction

There are two observations about the way extension rules interact in successive extensions and phases. These two observations lead to two more implementation tricks (first trick skip/count is seen already while walk down).

Observation 1: Rule 3 is show stopper In a phase i, there are i extensions (1 to i) to be done. When rule 3 applies in any extension j of phase i+1 (i.e. path labelled S[j..i] continues with character S[i+1]), then it will also apply in all further extensions of same phase (i.e. extensions j+1 to i+1 in phase i+1). Thats because if path labelled S[j..i] continues with character S[i+1], then path labelled S[j+1..i], S[j+2..i], S[j+3..i],, S[i..i] will also continue with character S[i+1]. Consider Figure 11, Figure12 and Figure 13 in Part 1 where Rule 3 is applied. In Figure 11, xab is added in tree and in Figure 12 (Phase 4), we add next character x. In this, 3 extensions are done (which adds 3 suffixes). Last suffix x is already present in tree. In Figure 13, we add character a in tree (Phase 5). First 3 suffixes are added in tree and last two suffixes xa and a are already present in tree. This shows that if suffix S[j..i] present in tree, then ALL the remaining suffixes S[j+1..i], S[j+2..i], S[j+3..i],, S[i..i] will also be there in tree and no work needed to add those remaining suffixes. So no more work needed to be done in any phase as soon as rule 3 applies in any extension in that phase. If a new internal node v gets created in extension j and rule 3 applies in next extension j+1, then we need to add suffix link from node v to current node (if we are on internal node) or root node. ActiveNode, which will be discussed in part 3, will help while setting suffix links.

Trick 2 Stop the processing of any phase as soon as rule 3 applies. All further extensions are already present in tree implicitly.

Observation 2: Once a leaf, always a leaf Once a leaf is created and labelled j (for suffix starting at position j in string S), then this leaf will always be a leaf in successive phases and extensions. Once a leaf is labelled as j, extension rule 1 will always apply to extension j in all successive phases. Consider Figure 9 to Figure 14 in Part 1. In Figure 10 (Phase 2), Rule 1 is applied on leaf labelled 1. After this, in all successive phases, rule 1 is always applied on this leaf. In Figure 11 (Phase 3), Rule 1 is applied on leaf labelled 2. After this, in all successive phases, rule 1 is always applied on this leaf. In Figure 12 (Phase 4), Rule 1 is applied on leaf labelled 3. After this, in all successive phases, rule 1 is always applied on this leaf.

In any phase i, there is an initial sequence of consecutive extensions where rule 1 or rule 2 are applied and then as soon as rule 3 is applied, phase i ends. Also rule 2 creates a new leaf always (and internal node sometimes). If Ji represents the last extension in phase i when rule 1 or 2 was applied (i.e after ith phase, there will be Ji leaves labelled 1, 2, 3, , Ji) , then Ji <= Ji+1 Ji will be equal to Ji+1 when there are no new leaf created in phase i+1 (i.e rule 3 is applied in Ji+1 extension) In Figure 11 (Phase 3), Rule 1 is applied in 1st two extensions and Rule 2 is applied in 3rd extension, so here J3 = 3 In Figure 12 (Phase 4), no new leaf created (Rule 1 is applied in 1st 3 extensions and then rule 3 is applied in 4th extension which ends the phase). Here J4 = 3 = J3 In Figure 13 (Phase 5), no new leaf created (Rule 1 is applied in 1st 3 extensions and then rule 3 is applied in 4th extension which ends the phase). Here J5 = 3 = J4 Ji will be less than Ji+1 when few new leaves are created in phase i+1. In Figure 14 (Phase 6), new leaf created (Rule 1 is applied in 1st 3 extensions and then rule 2 is applied in last 3 extension which ends the phase). Here J6 = 6 > J5

So we can see that in phase i+1, only rule 1 will apply in extensions 1 to Ji (which really doesnt need much work, can be done in constant time and thats the trick 3), extension Ji+1 onwards, rule 2 may apply to zero or more extensions and then finally rule 3, which ends the phase. Now edge labels are represented using two indices (start, end), for any leaf edge, end will always be equal to phase number i.e. for phase i, end = i for leaf edges, for phase i+1, end = i+1 for leaf edges.

Trick 3 In any phase i, leaf edges may look like (p, i), (q, i), (r, i), . where p, q, r are starting position of different edges and i is end position of all. Then in phase i+1, these leaf edges will look like (p, i+1), (q, i+1), (r, i+1),. This way, in each phase, end position has to be incremented in all leaf edges. For this, we need to traverse through all leaf edges and increment end position for them. To do same thing in constant time, maintain a global index e and e will be equal to phase number. So now leaf edges will look like (p, e), (q, e), (r, e).. In any phase, just increment e and extension on all leaf edges will be done. Figure 19 shows this.

So using suffix links and tricks 1, 2 and 3, a suffix tree can be built in linear time.

Tree Tm could be implicit tree if a suffix is prefix of another. So we can add a $ terminal symbol first and then run algorithm to get a true suffix tree (A true suffix tree contains all suffixes explicitly). To label each leaf with corresponding suffix starting position (all leaves are labelled as global index e), a linear time traversal can be done on tree.

At this point, we have gone through most of the things we needed to know to create suffix tree using Ukkonens algorithm. In next Part 3, we will take string S = abcabxabcd as an example and go through all the things step by step and create the tree. While building the tree, we will discuss few more implementation issues which will be addressed by ActivePoints. We will continue to discuss the algorithm in Part 4 and Part 5. Code implementation will be discussed in Part 6.

Rectangle 27 0

Ukkonens Suffix Tree Construction Part 3

Please go through Part 1 and Part 2, before looking at current article, where we have seen few basics on suffix tree, high level ukkonens algorithm, suffix link and three implementation tricks.

Here we will take string S = abcabxabcd as an example and go through all the things step by step and create the tree. We will add $ (discussed in Part 1 why we do this) so string S would be abcabxabcd$.

While building suffix tree for string S of length m:

Now we will see few observations and how to implement those.

  • At the end of any phase i, there are at most i leaf edges (if ith character is not seen so far, there will be i leaf edges, else there will be less than i leaf edges). e.g. After phases 1, 2 and 3 in our example, there are 1, 2 and 3 leaf edges respectively, but after phase 4, there are 3 leaf edges only (not 4).
  • After completing phase i, end indices of all leaf edges are i. How do we implement this in code? Do we need to iterate through all those extensions, find leaf edges by traversing from root to leaf and increment the end index? Answer is NO. For this, we will maintain a global variable (say END) and we will just increment this global variable END and all leaf edge end indices will point to this global variable. So this way, if we have j leaf edges after phase i, then in phase i+1, first j extensions (1 to j) will be done by just incrementing variable END by 1 (END will be i+1 at the point). Here we just implemented the trick 3 Once a leaf, always a leaf. This trick processes all the j leaf edges (i.e. extension 1 to j) using rule 1 in a constant time in any phase. Rule 1 will not apply to subsequent extensions in the same phase. This can be verified in the four phases we discussed above. If at all Rule 1 applies in any phase, it only applies in initial few phases continuously (say 1 to j). Rule 1 never applies later in a given phase once Rule 2 or Rule 3 is applied in that phase.

In the next phase i+1, trick 3 (Rule 1) will take care of first j-1 suffixes (the j-1 leaf edges), then extension j will start where we will add jth suffix in tree. For this, we need to find the best possible matching edge and then add new character at the end of that edge. How to find the end of best matching edge? Do we need to traverse from root node and match tree edges against the jth suffix being added character by character? This will take time and overall algorithm will not be linear. activePoint comes to the rescue here. In previous phase i, while jth extension, path traversal ended at a point (which could be an internal node or some point in the middle of an edge) where ith character being added was found in tree already and Rule 3 applied, jth extension of phase i+1 will start exactly from the same point and we start matching path against (i+1)th character. activePoint helps to avoid unnecessary path traversal from root in any extension based on the knowledge gained in traversals done in previous extension. There is no traversal needed in 1st p extensions where Rule 1 is applied. Traversal is done where Rule 2 or Rule 3 gets applied and thats where activePoint tells the starting point for traversal where we match the path against the current character being added in tree. Implementation is done in such a way that, in any extension where we need a traversal, activePoint is set to right location already (with one exception case APCFALZ discussed below) and at the end of current extension, we reset activePoint as apprppriate so that next extension (of same phase or next phase) where a traversal is required, activePoint points to the right place already.

activePoint: This could be root node, any internal node or any point in the middle of an edge. This is the point where traversal starts in any extension. For the 1st extension of phase 1, activePoint is set to root. Other extension will get activePoint set correctly by previous extension (with one exception case APCFALZ discussed below) and it is the responsibility of current extension to reset activePoint appropriately at the end, to be used in next extension where Rule 2 or Rule 3 is applied (of same or next phase). To accomplish this, we need a way to store activePoint. We will store this using three variables: activeNode, activeEdge, activeLength.activeNode: This could be root node or an internal node.activeEdge: When we are on root node or internal node and we need to walk down, we need to know which edge to choose. activeEdge will store that information. In case, activeNode itself is the point from where traversal starts, then activeEdge will be set to next character being processed in next phase.activeLength: This tells how many characters we need to walk down (on the path represented by activeEdge) from activeNode to reach the activePoint where traversal starts. In case, activeNode itself is the point from where traversal starts, then activeLength will be ZERO.(click on below image to see it clearly)

Ukkonen's Suffix Tree Construction

After phase i, if there are j leaf edges then in phase i+1, first j extensions will be done by trick 3. activePoint will be needed for the extensions from j+1 to i+1 and activePoint may or may not change between two extensions depending on the point where previous extension ends.

activePoint change for extension rule 3 (APCFER3): When rule 3 applies in any phase i, then before we move on to next phase i+1, we increment activeLength by 1. There is no change in activeNode and activeEdge. Why? Because in case of rule 3, the current character from string S is matched on the same path represented by current activePoint, so for next activePoint, activeNode and activeEdge remain the same, only activeLenth is increased by 1 (because of matched character in current phase). This new activePoint (same node, same edge and incremented length) will be used in phase i+1.

activePoint change for walk down (APCFWD): activePoint may change at the end of an extension based on extension rule applied. activePoint may also change during the extension when we do walk down. Lets consider an activePoint is (A, s, 11) in the above activePoint example figure. If this is the activePoint at the start of some extension, then while walk down from activeNode A, other internal nodes will be seen. Anytime if we encounter an internal node while walk down, that node will become activeNode (it will change activeEdge and activeLenght as appropriate so that new activePoint represents the same point as earlier). In this walk down, below is the sequence of changes in activePoint: (A, s, 11) >>> (B, w, 7) - >>> (C, a, 3) All above three activePoints refer to same point c Lets take another example. If activePoint is (D, a, 11) at the start of an extension, then while walk down, below is the sequence of changes in activePoint: (D, a, 10) >>> (E, d, 7) >>> (F, f, 5) >> (F, j, 1) All above activePoints refer to same point k. If activePoints are (A, s, 3), (A, t, 5), (B, w, 1), (D, a, 2) etc when no internal node comes in the way while walk down, then there will be no change in activePoint for APCFWD. The idea is that, at any time, the closest internal node from the point, where we want to reach, should be the activePoint. Why? This will minimize the length of traversal in the next extension.

activePoint change for Active Length ZERO (APCFALZ): Lets consider an activePoint (A, s, 0) in the above activePoint example figure. And lets say current character being processed from string S is x (or any other character). At the start of extension, when activeLength is ZERO, activeEdge is set to the current character being processed, i.e. x, because there is no walk down needed here (as activeLength is ZERO) and so next character we look for is current character being processed.

  • While code implementation, we will loop through all the characters of string S one by one. Each loop for ith character will do processing for phase i. Loop will run one or more time depending on how many extensions are left to be performed (Please note that in a phase i+1, we dont really have to perform all i+1 extensions explicitly, as trick 3 will take care of j extensions for all j leaf edges coming from previous phase i). We will use a variable remainingSuffixCount, to track how many extensions are yet to be performed explicitly in any phase (after trick 3 is performed). Also, at the end of any phase, if remainingSuffixCount is ZERO, this tells that all suffixes supposed to be added in tree, are added explicitly and present in tree. If remainingSuffixCount is non-zero at the end of any phase, that tells that suffixes of that many count are not added in tree explicitly (because of rule 3, we stopped early), but they are in tree implicitly though (Such trees are called implicit suffix tree). These implicit suffixes will be added explicitly in subsequent phases when a unique character comes in the way.

We will continue our discussion in Part 4 and Part 5. Code implementation will be discussed in Part 6.

Rectangle 27 0

Ukkonens Suffix Tree Construction Part 6

Please go through Part 1, Part 2, Part 3, Part 4 and Part 5, before looking at current article, where we have seen few basics on suffix tree, high level ukkonens algorithm, suffix link and three implementation tricks and activePoints along with an example string abcabxabcd where we went through all phases of building suffix tree. Here, we will see the data structure used to represent suffix tree and the code implementation.

At that end of Part 5 article, we have discussed some of the operations we will be doing while building suffix tree and later when we use suffix tree in different applications. There could be different possible data structures we may think of to fulfill the requirements where some data structure may be slow on some operations and some fast. Here we will use following in our implementation:

We will have SuffixTreeNode structure to represent each node in tree. SuffixTreeNode structure will have following members:

This data structure will answer to the required queries quickly as below:

Following is C implementation of Ukkonens Suffix Tree Construction. The code may look a bit lengthy, probably because of a good amount of comments.

// A C program to implement Ukkonen's Suffix Tree Construction
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#define MAX_CHAR 256

struct SuffixTreeNode {
    struct SuffixTreeNode *children[MAX_CHAR];

    //pointer to other node via suffix link
    struct SuffixTreeNode *suffixLink;

    /*(start, end) interval specifies the edge, by which the
     node is connected to its parent node. Each edge will
     connect two nodes,  one parent and one child, and
     (start, end) interval of a given edge  will be stored
     in the child node. Lets say there are two nods A and B
     connected by an edge with indices (5, 8) then this
     indices (5, 8) will be stored in node B. */
    int start;
    int *end;

    /*for leaf nodes, it stores the index of suffix for
      the path  from root to leaf*/
    int suffixIndex;
};

typedef struct SuffixTreeNode Node;

char text[100]; //Input string
Node *root = NULL; //Pointer to root node

/*lastNewNode will point to newly created internal node,
  waiting for it's suffix link to be set, which might get
  a new suffix link (other than root) in next extension of
  same phase. lastNewNode will be set to NULL when last
  newly created internal node (if there is any) got it's
  suffix link reset to new internal node created in next
  extension of same phase. */
Node *lastNewNode = NULL;
Node *activeNode = NULL;

/*activeEdge is represeted as input string character
  index (not the character itself)*/
int activeEdge = -1;
int activeLength = 0;

// remainingSuffixCount tells how many suffixes yet to
// be added in tree
int remainingSuffixCount = 0;
int leafEnd = -1;
int *rootEnd = NULL;
int *splitEnd = NULL;
int size = -1; //Length of input string

Node *newNode(int start, int *end)
{
    Node *node =(Node*) malloc(sizeof(Node));
    int i;
    for (i = 0; i < MAX_CHAR; i++)
          node->children[i] = NULL;

    /*For root node, suffixLink will be set to NULL
    For internal nodes, suffixLink will be set to root
    by default in  current extension and may change in
    next extension*/
    node->suffixLink = root;
    node->start = start;
    node->end = end;

    /*suffixIndex will be set to -1 by default and
      actual suffix index will be set later for leaves
      at the end of all phases*/
    node->suffixIndex = -1;
    return node;
}

int edgeLength(Node *n) {
    return *(n->end) - (n->start) + 1;
}

int walkDown(Node *currNode)
{
    /*activePoint change for walk down (APCFWD) using
     Skip/Count Trick  (Trick 1). If activeLength is greater
     than current edge length, set next  internal node as
     activeNode and adjust activeEdge and activeLength
     accordingly to represent same activePoint*/
    if (activeLength >= edgeLength(currNode))
    {
        activeEdge += edgeLength(currNode);
        activeLength -= edgeLength(currNode);
        activeNode = currNode;
        return 1;
    }
    return 0;
}

void extendSuffixTree(int pos)
{
    /*Extension Rule 1, this takes care of extending all
    leaves created so far in tree*/
    leafEnd = pos;

    /*Increment remainingSuffixCount indicating that a
    new suffix added to the list of suffixes yet to be
    added in tree*/
    remainingSuffixCount++;

    /*set lastNewNode to NULL while starting a new phase,
     indicating there is no internal node waiting for
     it's suffix link reset in current phase*/
    lastNewNode = NULL;

    //Add all suffixes (yet to be added) one by one in tree
    while(remainingSuffixCount > 0) {

        if (activeLength == 0)
            activeEdge = pos; //APCFALZ

        // There is no outgoing edge starting with
        // activeEdge from activeNode
        if (activeNode->children[text[activeEdge]] == NULL)
        {
            //Extension Rule 2 (A new leaf edge gets created)
            activeNode->children[text[activeEdge]] =
                                          newNode(pos, &leafEnd);

            /*A new leaf edge is created in above line starting
             from  an existng node (the current activeNode), and
             if there is any internal node waiting for it's suffix
             link get reset, point the suffix link from that last
             internal node to current activeNode. Then set lastNewNode
             to NULL indicating no more node waiting for suffix link
             reset.*/
            if (lastNewNode != NULL)
            {
                lastNewNode->suffixLink = activeNode;
                lastNewNode = NULL;
            }
        }
        // There is an outgoing edge starting with activeEdge
        // from activeNode
        else
        {
            // Get the next node at the end of edge starting
            // with activeEdge
            Node *next = activeNode->children[text[activeEdge]];
            if (walkDown(next))//Do walkdown
            {
                //Start from next node (the new activeNode)
                continue;
            }
            /*Extension Rule 3 (current character being processed
              is already on the edge)*/
            if (text[next->start + activeLength] == text[pos])
            {
                //If a newly created node waiting for it's 
				//suffix link to be set, then set suffix link 
				//of that waiting node to curent active node
				if(lastNewNode != NULL && activeNode != root)
				{
					lastNewNode->suffixLink = activeNode;
					lastNewNode = NULL;
				}

                //APCFER3
                activeLength++;
                /*STOP all further processing in this phase
                and move on to next phase*/
                break;
            }

            /*We will be here when activePoint is in middle of
              the edge being traversed and current character
              being processed is not  on the edge (we fall off
              the tree). In this case, we add a new internal node
              and a new leaf edge going out of that new node. This
              is Extension Rule 2, where a new leaf edge and a new
            internal node get created*/
            splitEnd = (int*) malloc(sizeof(int));
            *splitEnd = next->start + activeLength - 1;

            //New internal node
            Node *split = newNode(next->start, splitEnd);
            activeNode->children[text[activeEdge]] = split;

            //New leaf coming out of new internal node
            split->children[text[pos]] = newNode(pos, &leafEnd);
            next->start += activeLength;
            split->children[text[next->start]] = next;

            /*We got a new internal node here. If there is any
              internal node created in last extensions of same
              phase which is still waiting for it's suffix link
              reset, do it now.*/
            if (lastNewNode != NULL)
            {
            /*suffixLink of lastNewNode points to current newly
              created internal node*/
                lastNewNode->suffixLink = split;
            }

            /*Make the current newly created internal node waiting
              for it's suffix link reset (which is pointing to root
              at present). If we come across any other internal node
              (existing or newly created) in next extension of same
              phase, when a new leaf edge gets added (i.e. when
              Extension Rule 2 applies is any of the next extension
              of same phase) at that point, suffixLink of this node
              will point to that internal node.*/
            lastNewNode = split;
        }

        /* One suffix got added in tree, decrement the count of
          suffixes yet to be added.*/
        remainingSuffixCount--;
        if (activeNode == root && activeLength > 0) //APCFER2C1
        {
            activeLength--;
            activeEdge = pos - remainingSuffixCount + 1;
        }
        else if (activeNode != root) //APCFER2C2
        {
            activeNode = activeNode->suffixLink;
        }
    }
}

void print(int i, int j)
{
    int k;
    for (k=i; k<=j; k++)
        printf("%c", text[k]);
}

//Print the suffix tree as well along with setting suffix index
//So tree will be printed in DFS manner
//Each edge along with it's suffix index will be printed
void setSuffixIndexByDFS(Node *n, int labelHeight)
{
    if (n == NULL)  return;

    if (n->start != -1) //A non-root node
    {
        //Print the label on edge from parent to current node
        print(n->start, *(n->end));
    }
    int leaf = 1;
    int i;
    for (i = 0; i < MAX_CHAR; i++)
    {
        if (n->children[i] != NULL)
        {
            if (leaf == 1 && n->start != -1)
                printf(" [%d]\n", n->suffixIndex);

            //Current node is not a leaf as it has outgoing
            //edges from it.
            leaf = 0;
            setSuffixIndexByDFS(n->children[i], labelHeight +
                                  edgeLength(n->children[i]));
        }
    }
    if (leaf == 1)
    {
        n->suffixIndex = size - labelHeight;
        printf(" [%d]\n", n->suffixIndex);
    }
}

void freeSuffixTreeByPostOrder(Node *n)
{
    if (n == NULL)
        return;
    int i;
    for (i = 0; i < MAX_CHAR; i++)
    {
        if (n->children[i] != NULL)
        {
            freeSuffixTreeByPostOrder(n->children[i]);
        }
    }
    if (n->suffixIndex == -1)
        free(n->end);
    free(n);
}

/*Build the suffix tree and print the edge labels along with
suffixIndex. suffixIndex for leaf edges will be >= 0 and
for non-leaf edges will be -1*/
void buildSuffixTree()
{
    size = strlen(text);
    int i;
    rootEnd = (int*) malloc(sizeof(int));
    *rootEnd = - 1;

    /*Root is a special node with start and end indices as -1,
    as it has no parent from where an edge comes to root*/
    root = newNode(-1, rootEnd);

    activeNode = root; //First activeNode will be root
    for (i=0; i<size; i++)
        extendSuffixTree(i);
    int labelHeight = 0;
    setSuffixIndexByDFS(root, labelHeight);

    //Free the dynamically allocated memory
    freeSuffixTreeByPostOrder(root);
}

// driver program to test above functions
int main(int argc, char *argv[])
{
//  strcpy(text, "abc"); buildSuffixTree();
//  strcpy(text, "xabxac#");    buildSuffixTree();
//  strcpy(text, "xabxa");  buildSuffixTree();
//  strcpy(text, "xabxa$"); buildSuffixTree();
    strcpy(text, "abcabxabcd$"); buildSuffixTree();
//  strcpy(text, "geeksforgeeks$"); buildSuffixTree();
//  strcpy(text, "THIS IS A TEST TEXT$"); buildSuffixTree();
//  strcpy(text, "AABAACAADAABAAABAA$"); buildSuffixTree();
    return 0;
}

Output (Each edge of Tree, along with suffix index of child node on edge, is printed in DFS order. To understand the output better, match it with the last figure no 43 in previous Part 5 article):

$ [10]
ab [-1]
c [-1]
abxabcd$ [0]
d$ [6]
xabcd$ [3]
b [-1]
c [-1]
abxabcd$ [1]
d$ [7]
xabcd$ [4]
c [-1]
abxabcd$ [2]
d$ [8]
d$ [9]
xabcd$ [5]

Now we are able to build suffix tree in linear time, we can solve many string problem in efficient way:

The above basic problems can be solved by DFS traversal on suffix tree. We will soon post articles on above problems and others like below:

Rectangle 27 0

Try to build a Generalized Suffix Tree if the dictionary will be matched by sequence of queries. There is linear time algorithm that can be used to build such tree (Ukkonen Suffix Tree Construction).

You can easily match (it's O(k), where k is the size of the query) each query by traversing from the root node, and use the wildcard character to match any character like typical pattern finding in suffix tree.

performance - Efficient data structure for word lookup with wildcards ...

performance algorithm design data-structures memory-management
Rectangle 27 0

It's not a direct answer, however it can help you.

Last year, while working on the subject, I ended using suffix-arrays instead of suffix-trees, and IIRC, I used the paper "An incomplex algorithm for fast suffix array construction " KB Schrmann (2007) [1] as a reference. IIRC, it's a two pass linear algorithm to build suffix-arrays.

The algorithm has empirically been shown to perform extremely well, but as far as I know (and stated by the authors) it has not been proven to be of linear complexity.

algorithm - Conceptually simple linear-time suffix tree constructions ...

algorithm data-structures suffix-tree
Rectangle 27 0

Ukkonens Suffix Tree Construction Part 1

Suffix Tree is very useful in numerous string processing and computational biology problems. Many books and e-resources talk about it theoretically and in few places, code implementation is discussed. But still, I felt something is missing and its not easy to implement code to construct suffix tree and its usage in many applications. This is an attempt to bridge the gap between theory and complete working code implementation. Here we will discuss Ukkonens Suffix Tree Construction Algorithm. We will discuss it in step by step detailed way and in multiple parts from theory to implementation. We will start with brute force way and try to understand different concepts, tricks involved in Ukkonens algorithm and in the last part, code implementation will be discussed.Note: You may find some portion of the algorithm difficult to understand while 1st or 2nd reading and its perfectly fine. With few more attempts and thought, you should be able to understand such portions.

Rectangle 27 0

*********************Phase 2*********************************

At the end of phase 2, remainingSuffixCount is ZERO (All suffixes are added explicitly). Figure 22 in Part 3 is the resulting tree after phase 2.

Rectangle 27 0

*********************Phase 4*********************************

At the end of phase 4, remainingSuffixCount is 1 (One suffix a, the last one, is not added explicitly in tree, but it is there in tree implicitly). Figure 28 in Part 3 is the resulting tree after phase 4. Revisiting completed for 1st four phases, we will continue building the tree and see how it goes.

Rectangle 27 0

*********************Phase 9*********************************

*********************Phase 8*********************************

*********************Phase 7*********************************

*********************Phase 10*********************************

*********************Phase 11*********************************

At the end of phase 9, remainingSuffixCount is 3 (Three suffixes, abc, bc and c, the last three, are not added explicitly in tree explicitly, but they are in tree implicitly).

At the end of phase 8, remainingSuffixCount is 2 (Two suffixes, ab and b, the last two, are not added explicitly in tree explicitly, but they are in tree implicitly).

At the end of phase 7, remainingSuffixCount is 1 (One suffix a, the last one, is not added explicitly in tree, but it is there in tree implicitly). Above Figure 33 is the resulting tree after phase 7.

Now we have added all suffixes of string abcabxabcd$ in suffix tree. There are 11 leaf ends in this tree and labels on the path from root to leaf end represents one suffix. Now the only one thing left is to assign a number (suffix index) to each leaf end and that number would be the suffix starting position in the string S. This can be done by a DFS traversal on tree. While DFS traversal, keep track of label length and when a leaf end is found, set the suffix index as stringSize labelSize + 1. Indexed suffix tree will look like below:

Ukkonen's Suffix Tree Construction

In above Figure, suffix indices are shown as character position starting with 1 (Its not zero indexed). In code implementation, suffix index will be set as zero indexed, i.e. where we see suffix index j (1 to m for string of length m) in above figure, in code implementation, it will be j-1 (0 to m-1) And we are done !!!! Data Structure to represent suffix tree How to represent the suffix tree?? There are nodes, edges, labels and suffix links and indices. Below are some of the operations/query we will be doing while building suffix tree and later on while using the suffix tree in different applications/usages:

We see following facts in Phase 10:

We may think of different data structures which can fulfil these requirements. In the next Part 6, we will discuss the data structure we will use in our code implementation and the code as well.

Rectangle 27 0

Dan Gusfield

A suffix tree T for a m-character string S is a rooted directed tree with exactly m leaves numbered 1 to m. (Given that last string character is unique in string)

Concatenation of the edge-labels on the path from the root to leaf i gives the suffix of S that starts at position i, i.e. S[im].

Note: Position starts with 1 (its not zero indexed, but later, while code implementation, we will used zero indexed position)

For string S = xabxac with m = 6, suffix tree will look like following:

It has one root node and two internal nodes and 6 leaf nodes.

String Depth of red path is 1 and it represents suffix c starting at position 6 String Depth of blue path is 4 and it represents suffix bxca starting at position 3 String Depth of green path is 2 and it represents suffix ac starting at position 5 String Depth of orange path is 6 and it represents suffix xabxac starting at position 1

Edges with labels a (green) and xa (orange) are non-leaf edge (which ends at an internal node). All other edges are leaf edge (ends at a leaf)

If one suffix of S matches a prefix of another suffix of S (when last character in not unique in string), then path for the first suffix would not end at a leaf.

For String S = xabxa, with m = 5, following is the suffix tree:

Here we will have 5 suffixes: xabxa, abxa, bxa, xa and a. Path for suffixes xa and a do not end at a leaf. A tree like above (Figure 2) is called implicit suffix tree as some suffixes (xa and a) are not seen explicitly in tree.

To avoid this problem, we add a character which is not present in string already. We normally use $, # etc as termination characters. Following is the suffix tree for string S = xabxa$ with m = 6 and now all 6 suffixes end at leaf.

A naive algorithm to build a suffix tree Given a string S of length m, enter a single edge for suffix S[l ..m]$ (the entire string) into the tree, then successively enter suffix S[i..m]$ into the growing tree, for i increasing from 2 to m. Let Ni denote the intermediate tree that encodes all the suffixes from 1 to i. So Ni+1 is constructed from Ni as follows:

This takes O(m2) to build the suffix tree for the string S of length m. Following are few steps to build suffix tree based for string xabxa$ based on above algorithm:

Implicit suffix tree While generating suffix tree using Ukkonens algorithm, we will see implicit suffix tree in intermediate steps few times depending on characters in string S. In implicit suffix trees, there will be no edge with $ (or # or any other termination character) label and no internal node with only one edge going out of it. To get implicit suffix tree from a suffix tree S$,

High Level Description of Ukkonens algorithm Ukkonens algorithm constructs an implicit suffix tree Ti for each prefix S[l ..i] of S (of length m). It first builds T1 using 1st character, then T2 using 2nd character, then T3 using 3rd character, , Tm using mth character. Implicit suffix tree Ti+1 is built on top of implicit suffix tree Ti. The true suffix tree for S is built from Tm by adding $. At any time, Ukkonens algorithm builds the suffix tree for the characters seen so far and so it has on-line property that may be useful in some situations. Time taken is O(m).

Ukkonens algorithm is divided into m phases (one phase for each character in the string with length m) In phase i+1, tree Ti+1 is built from tree Ti.

Each phase i+1 is further divided into i+1 extensions, one for each of the i+1 suffixes of S[1..i+1] In extension j of phase i+1, the algorithm first finds the end of the path from the root labelled with substring S[j..i]. It then extends the substring by adding the character S(i+1) to its end (if it is not there already). In extension 1 of phase i+1, we put string S[1..i+1] in the tree. Here S[1..i] will already be present in tree due to previous phase i. We just need to add S[i+1]th character in tree (if not there already). In extension 2 of phase i+1, we put string S[2..i+1] in the tree. Here S[2..i] will already be present in tree due to previous phase i. We just need to add S[i+1]th character in tree (if not there already) In extension 3 of phase i+1, we put string S[3..i+1] in the tree. Here S[3..i] will already be present in tree due to previous phase i. We just need to add S[i+1]th character in tree (if not there already) . . In extension i+1 of phase i+1, we put string S[i+1..i+1] in the tree. This is just one character which may not be in tree (if character is seen first time so far). If so, we just add a new leaf edge with label S[i+1].

High Level Ukkonens algorithm Construct tree T1 For i from 1 to m-1 do begin {phase i+1} For j from 1 to i+1 begin {extension j} Find the end of the path from the root labelled S[j..i] in the current tree. Extend that path by adding character S[i+l] if it is not there already end; end;

Suffix extension is all about adding the next character into the suffix tree built so far. In extension j of phase i+1, algorithm finds the end of S[j..i] (which is already in the tree due to previous phase i) and then it extends S[j..i] to be sure the suffix S[j..i+1] is in the tree.

There are 3 extension rules:Rule 1: If the path from the root labelled S[j..i] ends at leaf edge (i.e. S[i] is last character on leaf edge) then character S[i+1] is just added to the end of the label on that leaf edge.

Rule 2: If the path from the root labelled S[j..i] ends at non-leaf edge (i.e. there are more characters after S[i] on path) and next character is not s[i+1], then a new leaf edge with label s{i+1] and number j is created starting from character S[i+1]. A new internal node will also be created if s[1..i] ends inside (in-between) a non-leaf edge.

Rule 3: If the path from the root labelled S[j..i] ends at non-leaf edge (i.e. there are more characters after S[i] on path) and next character is s[i+1] (already in tree), do nothing.

One important point to note here is that from a given node (root or internal), there will be one and only one edge starting from one character. There will not be more than one edges going out of any node, starting with same character.

Following is a step by step suffix tree construction of string xabxac using Ukkonens algorithm:

Ukkonen's Suffix Tree Construction

In next parts (Part 2, Part 3, Part 4 and Part 5), we will discuss suffix links, active points, few tricks and finally code implementations (Part 6).

Rectangle 27 0

Ukkonens Suffix Tree Construction Part 4

Please go through Part 1, Part 2 and Part 3, before looking at current article, where we have seen few basics on suffix tree, high level ukkonens algorithm, suffix link and three implementation tricks and some details on activePoint along with an example string abcabxabcd where we went through four phases of building suffix tree.

Lets revisit those four phases we have seen already in Part 3, in terms of trick 2, trick 3 and activePoint.