Suffix Trees in Java

I confess that I consider myself more of a software application developer than a computer scientist. I prefer to understand algorithms at a conceptual, intuitive level—how they do what they do, and when they should be used in an application. When I read an algorithm book, my eyes glaze over when presented with page after page of mathematical symbols and proofs. But recently, when a potential client presented me with a string searching problem, I decided to do a little background research and came across a little-known programming data structure that has changed my outlook on playing tricks with algorithms.

I was asked to find the longest sequentially repeating substring in a string of unknown length. That is, in the string "threetwooneonetwothree", "three" is repeated, but not sequentially; "one" is the longest string repeated sequentially (longer than, for example, "e"). Rather than chart off on my own (that's not what I would do in real life, after all), I did a Google search and immediately fell upon a similar problem, the longest repeated substring problem. The stated solution was to build something called a suffix tree.

This led me to pull out my long-owned but little-read copy of Algorithms on Strings, Trees, and Sequences by Dan Gusfield. With a little help from an invaluable article by Mark Nelson, Fast String Searching With Suffix Trees, it became clear that suffix trees are very interesting—and useful—animals. Part of my approach in the discussion below comes from L. Allison's Suffix Trees.

Suffix Trees in a Nutshell

Let's say we have the string "abba". We can make a list of all possible suffixes:

a
ba
bba
abba

You'll note that the two strings in the middle start with the same letter. We could save space and represent this commonality by creating a tree, in which "b" is written on the parent branch and the child branches are marked "a" and "ba":

.
+-a*
+-b
+-a*
+-ba*
+-abba*

I've marked the root node as '.' and put an '*' on the leaf nodes. One useful product of the suffix tree is that the full path from the root to any leaf spells out a suffix of the original string. This means that there will be exactly as many leaves as there are substrings in the string—which is also equal to the number of characters in the string. Note also that every node has at least two child branches, and that no two child branches of a node start with the same character.

Suffix trees take up an inordinate amount of memory compared to the original string, but their properties allow several searching and retrieval operations to take place extremely faster than searching the original string. For example, searching for "ba" in the string above would, in a naive approach, involve placing string "ba" beside the original string at every position to see if there is a match. Because the substring "ba" appears at the end, this could take quite a while for long strings.

With a suffix tree, however, we simply need to start from the root and compare each character, one at a time. Every substring in the original string will appear in the tree starting from the root; if it extends to the leaf (as "ba" does here), it is a suffix of the original string as well. Here we only care that we can start at "b" from the root at follow a path down to "ba", and we can do this comparison very quickly, because at each node we can either continue to the next character or not—we don't have to search through all the combinations.

Creating Suffix Trees

If we can stand the high memory costs of a suffix tree, we gain great search speedups once a suffix tree is constructed—but only if constructing the suffix tree itself doesn't take very long! Early attempts at constructing suffix trees took O(m2) time. Even such a slow construction time is useful if we want to analyze a long sequence and then later compare smaller strings against the stored suffix tree, but eventually Esko Ukkonen devised an algorithm that allows construction of suffix trees in linear time. Ukkonen's is furthermore an online algorithm, meaning we can start from the beginning and work our way along the string when constructing a suffix tree, rather than starting at the end and working backwords as did earlier approaches.

Conceptually Ukkonen's method first builds up an implicit suffix tree in which some suffixes may not end on a leaf if they are already contained in an existing path. The algorithm starts with an empty tree and iteravely adds each character to each branch as well as creates a new branch for the character:

  1. .
    +-a*
  2. .
    +-b*
    +-ab*
  3. .
    +-bb*
    +-abb*
  4. .
    +-b
    +-a*
    +-ba*
    +-abba*

A couple of subtleties should be noted. On step 2, we appended "b" to the existing branch and then added a new branch for "b". But on step 3, when we came to a second "b", we didn't need to add a new branch for "b" because the branch "bb" started with "b"—the "b" was implicit inside the branch "bb" (ending in the middle of the string, not on a leaf node). However, on step 4, we had to split the "bb" branch, because one of the "b"s is followed by "a" and the other by "ba".

That in essence is Ukkonen's algorithm, but that still costs substantially more than linear time. The brilliance of Ukkonen's approach is that it recognizes several shortcuts which can be made:

There is one twist: if the string ends in a suffix that is repeated, the resulting suffix tree will be implicit: some of the suffixes will not end on a leaf but instead end on in the middle of some branch. In step 4 above, for example, the suffix "a" is not given its own branch but is instead implicit in the "abba" branch.

To solve this problem, we merely need to change the ending implicit suffix tree into an explicit suffix tree. The traditional theoretical way is to run through the algorithm again with a character not found anywhere in the string. Most of the textbooks simply append the character "$" to the input stream and leave to the reader such practicalities as "Is '$' really so uncommon that I don't expect to see it in my input string?", and "How do I get rid of the '$' once I've constructed the suffix tree?" More on this later.

Suffix Trees in Java

Armed with an understanding of suffix trees, I was ready to tackle the interview problem, but there remained one issue: where would I find a suffix tree implementation in Java? Plenty of searching on the web yielded a possible candidate in a biology library, as well as a port (and its refactored update) by Illya Havsiyevych of Mark Nelson's C++ suffix tree. Neither of these seem to be exactly what I wanted; I didn't want an entire biology library just to get a suffix tree, and even if Illya's version were up to my standards (I don't know; I never downloaded the source code), I still needed more than a suffix tree—I needed a suffix tree with methods to solve practical problems. So as I commonly do, I decided to write my own.

I started with the C++ structure presented by Mark Nelson, but after everything was working I refactored and expanded the classes to allow practical applications. The main class is CharSequenceSuffixTree, which represents a Java CharSequence in a suffix tree. There are many tests in CharSequenceSuffixTreeTest which ensures that suffix trees are created correctly and can be searched. Because these classes depends on other code I've written over the years, it would probably be more useful to download and build the entire globalmentor-core library using Git and Maven:

git clone https://bitbucket.org/globalmentor/globalmentor-core.git
cd globalmentor-core
mvn install

The library is also available for inclusion as a Maven dependency or direct download from Maven Central Repository using coordinates com.globalmentor:globalmentor-core:x.x.x.

My Java suffix tree implementation differs from that in Fast String Searching With Suffix Trees in several aspects:

In addition there are several API enhancements. The algorithm works in terms of Node rather than merely node indexes. Nodes now know their own index, as well as their parent node, if any. Nodes even know if they are leaf nodes. The latter is done elegantly by using a single Java BitSet instance, using only one bit per node to indicate whether the node is a branch or a leaf. The classes allow normal Java iteration, for example of the child edges of a node.

Longest Sequentially Repeated Substring

After days of wrapping my head around suffix trees, and more days implementing one in Java, it remained to solve the original problem I had been given: finding the longest substring repeated sequentially. Solving the traditional longest repeated substring problem using a suffix tree is straightforward. After understanding how a suffix tree is created, it becomes clear that every time a branch is broken, the string from the left of the break back to the root represents some substring that was repeated (which is why we are breaking the branch in the first place, because there is more than one possibility of the character on the right side of the break). Thus, the longest repeated substring can be found by finding the non-leaf node that lies farthest away from the root (in terms of characters). This node can be found by simply walking the tree a single time. Once this node is found, the longest repeated substring is composed of the characters from the root to that node.

To find the longest repeated substring that is repeated sequentially, we simply need to look at each repeated substring (i.e. each non-leaf node), but before comparing its size to make sure it is the longest, we must first ensure that the string is repeated sequentially. To do this, we need to see if the repeated string (from the root to the current node) is contained in the child branches beginning at and emanating from the current node. This is simply the standard string comparison algorithm mentioned above, for which suffix trees are traditionally used, taking advantage of the fact that no two child branches of any one node will start with the same character.

Using my Java implementation of suffix trees and its convenience APIs for traversal and comparison, the final algorithm appears in CharSequenceSuffixTrees:

public static CharSequence getLongestSequentialRepeatedSubsequence(final CharSequence charSequence)
{
  final CharSequenceSuffixTree suffixTree = CharSequenceSuffixTree.create(charSequence);
  final ObjectHolder<String> result = new ObjectHolder<String>();
  visit(suffixTree, new AbstractCharSequenceVisitor()
  {
    int maxLength = 0;

    @Override
    public boolean visit(final SuffixTree suffixTree, final CharSequenceNode node,
        final CharSequenceEdge parentEdge, final CharSequence charSequence)
    {
      if(!node.isLeaf())
      {
        if(charSequence.length() > maxLength)
        {
          if(parentEdge.getChildNode().startsWith(charSequence))
          {
            maxLength = charSequence.length();
            result.setObject(charSequence.toString());
          }
        }
      }
      return true;
    }
  });
  return result.getObject();
}

As times goes on I will likely fix bugs and improve the API, so you are encouraged to download the full source code from the globalmentor-core repository, as outlined above. The source code there contains full comments as well.

This post was updated on 2016-06-18 to reflect the new location of the globalmentor-core repository.