Suffix ArrayEdit

Suffix arrays are a foundational data structure in string processing, providing a compact and searchable representation of all suffixes of a given string. Given a string S of length n, a suffix array SA is a permutation of the integers {0, 1, ..., n-1} such that the suffixes S[SA[i]..] appear in lexicographic order as i goes from 0 to n-1. By indexing into S with SA, many queries—such as locating occurrences of a pattern or performing fast substring searches—can be performed efficiently. In practice, a single end marker (often denoted by a character smaller than any character in the alphabet) is appended to the string to guarantee a well-defined ordering and to simplify comparisons.

Suffix arrays interact closely with other text-processing ideas, such as the LCP array (which stores the longest common prefix lengths between consecutive suffixes in SA) and the suffix tree (a more general, pointer-based representation of all suffixes). They are central to a range of applications from full-text search to data compression and genome analysis. Their compactness and the ability to support fast searches make them a standard tool in both theoretical and applied settings. For background and related concepts, see also text indexing and suffix tree.

Construction

Naive approach

A straightforward way to build a suffix array is to generate all suffixes of S, store their starting positions, and sort them lexicographically using a comparator that compares suffixes. This approach uses O(n) space to store the indices but O(n^2 log n) time in the worst case, since each comparison can examine up to O(n) characters. In practice, this method is mainly a teaching example or a fallback for very small inputs.

Efficient algorithms

Over the past decades, several algorithms achieve significantly better performance, bringing construction time down to near-linear in practice and to linear time in the best theoretical models.

Prefix-doubling (Manber–Myers): Start with sorting suffixes by their first character, then repeatedly double the considered prefix length (2^k) and refine the ordering. This yields O(n log n) time and O(n) space, typically implemented with stable sorting or radix sort to remain practical.
DC3 / Skew algorithm: A linear-time construction method that works by partitioning suffixes into residue classes modulo 3 and sorting a reduced problem, then inducing the full order. This yields O(n) time and O(n) space.
SA-IS (Induced Sorting): A highly practical linear-time algorithm that builds the suffix array by classifying characters and using induced sorting to propagate order. It is widely used in libraries for robust performance on large inputs.

In real-world software, SA-IS and the DC3-based approaches are common choices for large alphabets or very large strings, while doubling-based variants with efficient sorting are popular for moderate sizes or simpler implementations. A typical modern pipeline appends a unique terminator (often a character like '$') and then builds the suffix array for the augmented string.

Example For the string S = "banana$", the suffixes are: - "banana$" - "anana$" - "nana$" - "ana$" - "na$" - "a$" - "$"

The corresponding suffix array SA is [6, 5, 3, 1, 0, 4, 2], which lists the starting indices of suffixes in increasing lexicographic order.

For more on related indexing concepts, see text indexing and LCP array.

Applications

Pattern matching

With a suffix array, pattern matching can be performed by binary searching the range of suffixes that have the pattern as a prefix. Each comparison during the search compares the pattern with a suffix in S, and the result bounds the range of matches. The overall search time is O(m log n), where m is the length of the pattern, plus O(k) to report the k matches. If the LCP information is available, constant-factor improvements can be achieved for practical workloads. See also binary search and LCP array.

Text indexing and search

Suffix arrays provide a compact index for large text collections, enabling fast substring queries and support for operations like counting occurrences and locating all occurrences of a pattern. When combined with an LCP array or integrated into a suffix tree, they underpin efficient full-text search engines and document retrieval systems. See text indexing.

Data compression and genomic indexing

A key building block of the Burrows–Wheeler transform is the suffix array; together, they enable powerful preprocessing for compression and indexing. In bioinformatics, suffix arrays (often generalized to support multiple strings) are used to index reference genomes and assist in sequence alignment and variant discovery. See Burrows–Wheeler transform and genome indexing.

General-purpose string processing

Beyond search, suffix arrays support tasks such as repeated substring detection, longest repeated substring, and various formalisms in string algorithms. They also relate to constructions used in pattern matching libraries and text-processing tools.

Variants and related data structures

Suffix tree: A tree-based structure that compactly represents all suffixes of a string and supports similar queries with different performance characteristics. See suffix tree.
LCP array (Longest Common Prefix): Stores the length of the longest common prefix between adjacent suffixes in SA, enabling faster range queries and aiding certain types of pattern searches. See LCP array.
Generalized suffix array: Extends the concept to handle multiple strings simultaneously, useful for comparative genomics and multi-document search. See Generalized suffix array.
Inverse suffix array: The inverse permutation of SA, mapping a starting position to its rank in SA. See Inverse suffix array.
Variants for integer alphabets and external memory: Many practical implementations tailor the approach for specific data characteristics and resource constraints.