There are many times when programmers need to search for a substring, for example when parsing text. This is commonly referred to as searching for a needle (substring) in a haystack (the string to search in). The most straightforward way to do this is by using search functions that your language provides:
- C: strchr()/memchr(), strstr()/memmem()
- C++: string::find()
- Ruby: String#index or regular expressions
- Python: string.find() or regular expressions
However those functions are usually implemented in a naive way. They usually go through every index in the haystack and try to compare the substring at that index with the given needle. Thus, given a needle of length M and a haystack of length N, those algorithms need O(M*N) steps in the worst case. This is acceptable for small haystacks, but becomes more and more problematic as the the haystack grows.
In this article we’ll examine smarter algorithms, in particular Boyer-Moore and its variants. Finally we’ll present the reader with an optimized Boyer-Moore-Horspool implementation in C++. This implementation differs from most others in that it’s heavily optimized and is capable of accepting haystack input in a streaming manner.
Before we move on, it should be noted that Python’s string.find(), Ruby regular expressions and glibc’s implementation of memmem() actually can use smarter algorithms when conditions are right, but that is besides the main point of this article.
Demonstration of a naive substring search algorithm
Are there faster algorithms for searching needles? Why yes, computer science has already found the answer decades ago. Here are two algorithms that can find needles in less than O(M*N) time:
The ideas behind these algorithms are that, when a mismatch occurs during an iteration, you can sometimes intelligently skip several characters or you can start the next substring match at a higher position in order to avoid re-matching characters unnecessarily. Both algorithms are characterized by the need to precompute some preparation data. It is this data that allows them to search for needles in a way faster than O(M*N).
Knuth-Morris-Pratt simulates how a finite state automata accepts input symbols. During its precomputation phase it builds this automata, which requires O(M) time. Its main search algorithm, which runs this automata, has a worst case time complexity of O(N).
Boyer-Moore is similar, but in its precomputation phase it needs to build two tables, a so-called good suffix table and a bad suffix table. These tables are built from the needle contents. Its main search algorithm requires 3*N steps in the worst case, thus having a worst case time complexity of O(N).
More on Boyer-Moore
Boyer-Moore is a very peculiar algorithm because a longer needle makes it faster! This counter-intuitive property is caused by the fact Boyer-Moore matches the last needle character first and matches from right to left. If the last needle character doesn’t match the haystack then it knows it can skip M characters. When one of the other needle characters mismatch as well, it can usually skip more than 1 character based on the character which mismatched; this information is retrieved from the tables that had to be built in the precomputation phase. Thus, Boyer-Moore usually finishes in sublinear time.
Boyer-Moore has several variants.
- Turbo Boyer-Moore takes more space but needs 2*N steps in the worst case instead of the original’s 3*N steps.
- Boyer-Moore-Horspool uses less space (only requires the good suffix table) and has a simpler inner loop with less constant overhead per iteration. Its average performance is O(N) but it has a worst-case performance of O(M*N).
- Boyer-Moore-Horspool itself also has variants. In Timo Raita’s 1992 paper “Tuning the Boyer-Moore-Horspool String Searching Algorithm” he asserts that text is rarely random and that there usually exist strong dependencies between characters. Thus there are smarter ways to match the needle: instead of matching it backwards from the last character, one should first match the last character, then the first character, then the middle one, etc. The advantage of this heuristic becomes obvious if we consider an example in which we search for the needle “hello world” in a haystack that often contains the word “world” in combination with a word other than “hello”, e.g. “beautiful world”, “my world”, “another world”. When the algorithm matches the last needle character “d”, instead of wasting time matching “worl” as well, it can match the first needle character “h” and immediately determine that there is no match. This heuristic can drastically improve Boyer-Moore-Horspool’s average performance.
Theory vs practice
Which algorithm should one use, Boyer-Moore, Turbo Boyer-Moore or Boyer-Moore-Horspool? In theory Turbo Boyer-Moore is an attractive alternative because of its worst-case performance, but real-world experience has shown that theory is rarely the same as practice. Clearly, this calls for benchmarks.
Various implementations of Boyer-Moore (and variants) exist in a variety of languages, but for the purpose of benchmarking we will want to use C or C++ because they allow the most optimal implementations. (Or assembly, but let’s not be too l33t today.) However there seem to be few good C/C++ implementations out there that are:
After much digging I’ve found one C++ implementation that most matches our requirements. Its author, Joel Yliluoma, has agreed to open source his code under the MIT license, which I’m very grateful for. He had implemented Boyer-Moore, Turbo Boyer-Moore and Boyer-Moore-Horspool with some of the ideas by Timo Raita 1992. We’ve extended his work by adding unit tests, writing documentation and adding further optimizations. Using this extended code we’ve set up a benchmark suite. The results are as follows.
The benchmark is conducted on a MacBook Pro with OS X Snow Leopard and Intel Core 2 Duo 2.4 Ghz. For comparison reasons we’ve included benchmark results for strstr() and memmem() as well. strstr() is omitted in benchmark 1 because it cannot handle arbitrary random data. We’ve implemented memmem() ourselves because memmem() is a GNU extension and doesn’t exist on OS X. The test program is compiled with -O2 optimizations.
Benchmark 1: random data
This benchmark searches for the needle “I have control\n” in a 200 MB file containing random binary data, 10 iterations.
|Boyer-Moore-Horspool (Joel’s original implementation)||787 ms|
|Boyer-Moore-Horspool (our implementation)||729 ms|
|Turbo Boyer-Moore||1745 ms|
As expected, memmem() is the slowest of the bunch, being more than twice as slow as Boyer-Moore, but actually doesn’t perform *that* badly.
Turbo Boyer-Moore, despite its name, performs poorly compared to the original.
Boyer-Moore-Horspool is significantly faster than everything else. Our optimized implementation is slightly faster than Joel’s original.
Benchmark 2: Alice in Wonderland
This benchmark searches for the needle “I have control\n” in a special compilation of Alice in Wonderland. We took the HTML version from Project Gutenberg and multiplied its contents many times in order to generate a 200 MB Alice in Wonderland HTML file. This benchmark ran for 10 iterations. This is probably a much better real-world test than the random binary blob in benchmark 1.
|Boyer-Moore-Horspool (Joel’s original implementation)||1101 ms|
|Boyer-Moore-Horspool (our implementation)||1080 ms|
|Turbo Boyer-Moore||2683 ms|
The results are very similar to those of benchmark 1. Again, strstr() and memmem() are the slowest of the bunch, and our implementation of Boyer-Moore-Horspool the fastest.
Benchmark 3: triggering bad performance
In this benchmark we attempt to trigger bad peformance in Boyer-Moore-Horspool. We created a 200 MB file containing nothing but newlines, in which we look for the needle “I have control\n\n”. Notice the extra newline at the end of the needle! This causes the algorithm to skip slowly. The benchmark ran for 10 iterations.
|Boyer-Moore-Horspool (Joel’s original implementation)||12425 ms|
|Boyer-Moore-Horspool (our implementation)||9098 ms|
|Turbo Boyer-Moore||2281 ms|
Here, Boyer-Moore-Horspool is significantly worse than even memmem() despite the same theoretical worst-case complexity! This is because memmem() has a much simpler inner loop with lower constant overhead. We also see that our optimizations to Boyer-Moore-Horspool really help.
Different substring searching algorithms have significantly different performance characteristics. For small haystacks it doesn’t really matter which one you use, but for larger ones you should definitely consider smarter algorithms like Boyer-Moore. Using these smarter algorithms is not without cost however: because they require preparation (creating an FSA or building tables) you must ask yourself whether the cost of preparation is acceptable for your use cases.
Turbo Boyer-Moore is disappointing, its name doesn’t do it justice. In academia constant overhead doesn’t matter, but here we see that it matters a lot in practice. Turbo Boyer-Moore’s inner loop is so complex that we think we’re better off using the original Boyer-Moore.
We think that for most cases Boyer-Moore-Horspool is the best choice. Despite its theoretical worst-case performance, its worst case is seldomly triggered; we had to spend effort on generating a bad input file and picking a bad needle on purpose. Its inner loop is simple and easy to understand. Its simplicity also results in better CPU cache utilization and branching behavior. On average Boyer-Moore-Horspool performs so well that for us, the choice is clear. Boyer-Moore-Horspool is an excellent choice for things like parsing protocol headers and multipart MIME data. In case of multipart MIME data, most browsers generate fairly long boundary strings, allowing Boyer-Moore-Horspool to skip over non-boundary data very quickly.
Most string search algorithm implementations require the entire haystack data to be in memory. In contrast, we’ve written a “streaming” Boyer-Moore-Horspool implementation which allows one to feed the haystack data piece-of-piece. This is especially useful when parsing large amounts of protocol data received over the network without keeping all data in memory.
Our implementation is heavily optimized for both memory usage and CPU utilization as well as reusability. During the development of this implementation we’ve applied the highest degree of C++ kung-fu in order to get every last bit of efficiency out of the computer.
- It’s reentrant. No global variables. Instead the algorithm accepts a needle-specific context structure.
- All data structures are optimized for size and locality of reference. Smaller size means less memory usage, which in turn means less CPU cache usage. On modern systems CPU cache utilization is one of the most important performance factors.
- The context structure can be allocated in a single memory allocation, irregardless of the length of the needle. This minimizes memory allocation overhead, both in time (e.g. by allocating on the stack) and in space (by avoiding malloc() bookkeeping overhead). The algorithm does not need any more memory allocations than this initial one. The code also does not dictate a specific way to allocate context structure; the user is free to choose where to allocate it.
- We’ve also applied some of the ideas by Timo Raita 1992, but our implementation differs from Joel’s original implementation through minimizing function call overhead. Joel’s implementation calls memcmp(); ours first compares the first character using inline code.
- Our code contains various CPU branch prediction hints in order to minimize branching overhead.
- We make use of the ‘restrict’ keyword in order to allow better compiler optimizations.