Appearance
Suffix Array
Quick Revisit
- USE WHEN: Distinct substrings, longest repeated substring, pattern matching over all suffixes
- INVARIANT: sa[] is a permutation of [0..n-1] with suffix sa[i] < suffix sa[i+1] lexicographically; LCP array stores shared prefix lengths of adjacent sorted suffixes
- TIME: O(n log n) build, O(m log n) pattern search | SPACE: O(n)
- CLASSIC BUG: Missing sentinel character (
$< all others) so the last suffix sorts correctly- PRACTICE: 07-ladder-strings
Sort every suffix of a string, keep only the starting indices. This
Difficulty: [Advanced]Prerequisites: KMP and Z-Function, Sorting and SearchingCF Rating Relevance: 1900-2200+
Contest Frequency: * Specialized -- appears in advanced string problems
Contents
- Intuition
- C++ STL Reference
- Implementation (Contest-Ready)
- Operations Reference
- Problem Patterns & Tricks
- Contest Cheat Sheet
- Gotchas & Debugging
- Self-Test
- Practice Problems
- Further Reading
Intuition
Take the string "banana". You want to find every occurrence of a pattern, or the longest repeated substring. The brute-force plan: enumerate all
The problem? Comparing two suffixes costs
We need a way to sort suffixes where each comparison is
Sort suffixes by doubling -- first by 1 character, then 2, then 4, ... -- using the previous round's ranks to compare in
Here is the idea built from scratch:
Round 0. Rank every suffix by its first character alone. This is just sorting single characters --
via counting sort, or via comparison sort. Round 1. Now sort by the first 2 characters. But we already know the rank of every suffix by its first character. So the first-2-character key for suffix
is the pair . Comparing two pairs of integers is . Round 2. Sort by the first 4 characters. The key for suffix
is -- the rank of the first-2 block and the rank of the second-2 block (which starts at ). Again per comparison. General round
. Sort by the first characters using the pair . Each round is one sort of items with comparisons. Termination. After
rounds the prefix length exceeds , so we are sorting by the full suffix. Total: rounds sort , or if each round uses radix sort on pairs.
Once the suffix array is built, the LCP array (longest common prefix between each pair of adjacent sorted suffixes) solves "longest repeated substring" in
Visual Walkthrough
String: "banana$" ($ is a sentinel smaller than all letters.)
All suffixes:
i=0: banana$
i=1: anana$
i=2: nana$
i=3: ana$
i=4: na$
i=5: a$
i=6: $Round k=0 -- rank by first 1 character:
char: b a n a n a $
rank: [1, 0, 2, 0, 2, 0, -1]
($ mapped to -1, a=0, b=1, n=2)Round k=1 -- rank by first 2 characters, key = (rank[i], rank[i+1]):
i : pair meaning
0 : (1, 0) "ba"
1 : (0, 2) "an"
2 : (2, 0) "na"
3 : (0, 2) "an"
4 : (2, 0) "na"
5 : (0,-1) "a$"
6 : (-1,*) "$" (* = out-of-bounds -> -inf)
Sort pairs -> re-rank:
rank: [3, 2, 4, 2, 4, 1, 0]
b a n a n a $Suffixes 1 and 3 share rank 2 (both start with "an"), and suffixes 2 and 4 share rank 4 (both start with "na"). Ties remain -- we need more rounds.
Round k=2 -- rank by first 4 characters, key = (rank[i], rank[i+2]):
i : pair meaning
0 : (3, 4) "bana"
1 : (2, 2) "anan"
2 : (4, 4) "nana"
3 : (2, 1) "ana$"
4 : (4, 0) "na$"
5 : (1, *) "a$"
6 : (0, *) "$"
Sort pairs in ascending order, then re-rank:
rank: [4, 3, 6, 2, 5, 1, 0]
b a n a n a $Every pair above is distinct (the second component breaks the round-1 ties on "an" and "na"), so each suffix gets a unique rank. We can stop -- once all
Final suffix array (read off sorted order):
sa[0] = 6 $
sa[1] = 5 a$
sa[2] = 3 ana$
sa[3] = 1 anana$
sa[4] = 0 banana$
sa[5] = 4 na$
sa[6] = 2 nana$That is the suffix array: a permutation of
Past the construction: what the LCP array buys you
After the suffix array is built, the LCP array records the longest common prefix between each pair of adjacent sorted suffixes. The reason that is enough -- the reason we never need LCPs of arbitrary pairs to solve "longest repeated substring" or "count distinct substrings" -- is a neat structural fact:
For any
in sorted order, .
So adjacent LCPs subsume non-adjacent ones via range-minimum queries. That is why "distinct substrings" only needs
Kasai's LCP computation (traverse in text order, maintain height
Sorted suffixes: LCP with predecessor:
sa[0]=6: $ ---
sa[1]=5: a$ lcp[1] = 0
sa[2]=3: ana$ lcp[2] = 1 ("a")
sa[3]=1: anana$ lcp[3] = 3 ("ana")
sa[4]=0: banana$ lcp[4] = 0
sa[5]=4: na$ lcp[5] = 0
sa[6]=2: nana$ lcp[6] = 2 ("na")
lcp[] = { -, 0, 1, 3, 0, 0, 2 }
Longest repeated substring: max(lcp) = 3 --> "ana"
Distinct substrings = 7*6/2 - (0+1+3+0+0+2) = 21 - 6 = 15
(n=7 includes sentinel; formula is n*(n-1)/2 to exclude '$' substrings)Step-by-Step Trace: Kasai's LCP Algorithm for "banana$"
Kasai traverses in text order (i = 0, 1, ..., 6), not SA order. The variable h carries forward: it decreases by at most 1 between steps.
SA: [6, 5, 3, 1, 0, 4, 2]
rank: [4, 3, 6, 2, 5, 1, 0] (inverse of SA: rank[sa[j]] = j)
| Step | i | suffix | rank[i] | SA predecessor | h start | Compare | lcp | h end |
|------|---|-----------|---------|------------------------|---------|----------------------|-----|-------|
| 1 | 0 | banana$ | 4 | sa[3]=1 -> anana$ | 0 | 'b' != 'a' -> stop | 0 | 0 |
| 2 | 1 | anana$ | 3 | sa[2]=3 -> ana$ | 0 | a=a, n=n, a=a OK | 3 | 3 |
| | | | | | | then 'n' != '$' -> stop| | |
| 3 | 2 | nana$ | 6 | sa[5]=4 -> na$ | 3-1=2 | s[4]='n' != '$' ->stop | 2 | 2 |
| 4 | 3 | ana$ | 2 | sa[1]=5 -> a$ | 2-1=1 | s[4]='n' != '$' ->stop | 1 | 1 |
| 5 | 4 | na$ | 5 | sa[4]=0 -> banana$ | 1-1=0 | 'n' != 'b' -> stop | 0 | 0 |
| 6 | 5 | a$ | 1 | sa[0]=6 -> $ | 0 | 'a' != '$' -> stop | 0 | 0 |
| 7 | 6 | $ | 0 | (rank=0, no pred.) | 0 | -- skip -- | -- | 0 |
Result: lcp[] = [ -, 0, 1, 3, 0, 0, 2 ]
Key: h dropped 3->2->1->0 (one per step), never jumped down.
Total extensions across all steps <= 2n, giving Kasai's O(n) bound.The Invariant
After round
, the rank array correctly orders all suffixes by their first characters. Each round reuses the previous ranks: the sort key for suffix is the pair , which encodes characters in .
This is why doubling works: two known halves of length
Prefix-doubling: how each round doubles the comparison window
Round 0 (k=1): rank by first 1 character
┌───┐
│ b │ a n a n a $
└───┘
key = rank[i]
Round 1 (k=2): rank by first 2 characters
┌───┬───┐
│ b │ a │ n a n a $
└───┴───┘
key = (rank[i], rank[i+1])
Round 2 (k=4): rank by first 4 characters
┌───┬───┬───┬───┐
│ b │ a │ n │ a │ n a $
└───┴───┴───┴───┘
key = (rank[i], rank[i+2]) <- reuses round 1 ranks!
Round 3 (k=8 > n): all suffixes fully compared. DONE.
Each round: one sort of n pairs -> O(n log n)
Total rounds: O(log n) -> O(n log² n) overallWhat the Code Won't Teach You
sa[i] and sa[j] is
Kasai's
The sentinel character is load-bearing, not cosmetic. Appending '$' ensures no suffix is a prefix of another. Without it, the ordering of "an" vs "ana" is ambiguous (depending on how you break ties for unequal-length strings), and the prefix-doubling construction produces wrong ranks in intermediate rounds. Every formula involving the LCP array assumes the sentinel exists -- removing it silently corrupts the distinct-substring count and longest-repeated-substring answers.
Think of the suffix array as a sorted phone book for substrings. Every substring of
The "two strings" trick is more general than it looks. Concatenating two strings with a separator (
SA vs. SAM: know when each wins. The suffix array and suffix automaton solve many of the same problems, but their ergonomics differ. SA is better when you need the sorted order of suffixes (lexicographic queries,
When to Reach for This
Trigger phrases in problem statements:
- "longest repeated substring"
build SA + LCP, answer is . - "count distinct substrings"
SA + LCP, answer is . - "lexicographic order of suffixes / rotations"
direct SA application. - "LCP queries between arbitrary suffixes"
SA + LCP + sparse table RMQ. - "longest common substring of two strings"
concatenate with separator, SA + LCP scan.
Counter-examples -- do NOT use suffix array when:
- Single pattern search in a text
use KMP or Z-function in . - Dictionary of multiple patterns
use Aho-Corasick. - Only need hash-based substring comparison
use string hashing. - String is short (
) and problem is simple brute force or DP may suffice.
C++ STL Reference
| Function / Class | Header | What it does | Time Complexity |
|---|---|---|---|
sort(first, last) | <algorithm> | Used in each round of SA construction. | |
stable_sort(first, last, cmp) | <algorithm> | Alternative for stable rank assignment. | |
string::substr(pos, len) | <string> | Extract substring (debugging only -- | |
lower_bound(first, last, val, cmp) | <algorithm> | Binary search on SA for pattern matching. | |
upper_bound(first, last, val, cmp) | <algorithm> | Upper end of pattern match range in SA. | |
iota(first, last, val) | <numeric> | Fill SA with | |
string::compare(pos, len, other, opos, olen) | <string> | Lexicographic comparison of substrings. |
Implementation (Contest-Ready)
sa[i] vs rank[i] -- They're Inverses
sa[i]= which suffix is at position i in sorted order (sa maps rank -> suffix)rank[i]= where suffix starting at position i sits in sorted order (rank maps suffix -> rank)rank[sa[i]] = iandsa[rank[i]] = i-- they're inverse permutations.
Confusing the two is one of the most common suffix array bugs. When you want "the ith suffix in sorted order," use sa[i]. When you want "where does suffix i appear in the sorted order," use rank[i].
The Sentinel Character
Append a character SMALLER than any character in the string (typically $ or \0). This ensures no suffix is a prefix of another suffix, which is required for the suffix array construction to produce a unique, well-defined ordering.
Without the sentinel, suffixes like "ab" and "abc" have ambiguous ordering (is "ab" < "abc"? depends on convention). The sentinel makes "ab
Version 1: Minimal Contest Template -- SA + LCP
cpp
#include <bits/stdc++.h>
using namespace std;
// O(n log^2 n) suffix array construction
vector<int> build_sa(const string& s) {
int n = s.size();
vector<int> sa(n), rank_(n), tmp(n);
iota(sa.begin(), sa.end(), 0);
for (int i = 0; i < n; i++) rank_[i] = s[i];
for (int k = 1; k < n; k <<= 1) {
auto cmp = [&](int a, int b) {
if (rank_[a] != rank_[b]) return rank_[a] < rank_[b];
int ra = a + k < n ? rank_[a + k] : -1;
int rb = b + k < n ? rank_[b + k] : -1;
return ra < rb;
};
sort(sa.begin(), sa.end(), cmp);
tmp[sa[0]] = 0;
for (int i = 1; i < n; i++)
tmp[sa[i]] = tmp[sa[i - 1]] + (cmp(sa[i - 1], sa[i]) ? 1 : 0);
rank_ = tmp;
if (rank_[sa[n - 1]] == n - 1) break;
}
return sa;
}
// O(n) Kasai's algorithm for LCP array
vector<int> build_lcp(const string& s, const vector<int>& sa) {
int n = s.size();
vector<int> rank_(n), lcp(n, 0);
for (int i = 0; i < n; i++) rank_[sa[i]] = i;
int h = 0;
for (int i = 0; i < n; i++) {
if (rank_[i] > 0) {
int j = sa[rank_[i] - 1];
while (i + h < n && j + h < n && s[i + h] == s[j + h]) h++;
lcp[rank_[i]] = h;
if (h > 0) h--;
}
}
return lcp;
}
int main() {
ios::sync_with_stdio(false);
cin.tie(nullptr);
string s;
cin >> s;
s += '$';
vector<int> sa = build_sa(s);
vector<int> lcp = build_lcp(s, sa);
int n = s.size();
long long distinct = (long long)n * (n - 1) / 2;
for (int i = 1; i < n; i++) distinct -= lcp[i];
cout << distinct << "\n";
}Version 2: Explained Version -- SA + LCP with Comments
cpp
#include <bits/stdc++.h>
using namespace std;
// Build suffix array in O(n log^2 n) using prefix-doubling.
// Each round sorts suffixes by their first 2^k characters,
// using the ranking from the previous round as sort keys.
vector<int> build_sa(const string& s) {
int n = s.size();
vector<int> sa(n), rank_(n), tmp(n);
// Initialize: sa[i] = i, rank by single character
iota(sa.begin(), sa.end(), 0);
for (int i = 0; i < n; i++) rank_[i] = s[i];
for (int k = 1; k < n; k <<= 1) {
// Compare suffix i vs suffix j by the pair:
// (rank[i], rank[i+k]) vs (rank[j], rank[j+k])
// This effectively compares the first 2k characters.
auto cmp = [&](int a, int b) {
if (rank_[a] != rank_[b]) return rank_[a] < rank_[b];
// If i+k is out of bounds, that suffix is shorter -> rank = -1
int ra = a + k < n ? rank_[a + k] : -1;
int rb = b + k < n ? rank_[b + k] : -1;
return ra < rb;
};
sort(sa.begin(), sa.end(), cmp);
// Re-assign ranks based on new sorted order
// Equal pairs get the same rank
tmp[sa[0]] = 0;
for (int i = 1; i < n; i++)
tmp[sa[i]] = tmp[sa[i - 1]] + (cmp(sa[i - 1], sa[i]) ? 1 : 0);
rank_ = tmp;
// Early termination: all ranks are unique
if (rank_[sa[n - 1]] == n - 1) break;
}
return sa;
}
// Kasai's algorithm: compute LCP array in O(n).
// Key insight: if suffix at position i has lcp[rank[i]] = h with
// its predecessor in SA order, then the suffix at position i+1
// has lcp >= h-1 with its predecessor.
// We iterate in text order (i = 0..n-1) and maintain h.
vector<int> build_lcp(const string& s, const vector<int>& sa) {
int n = s.size();
vector<int> rank_(n), lcp(n, 0);
// Build rank (inverse of sa): rank[sa[i]] = i
for (int i = 0; i < n; i++) rank_[sa[i]] = i;
int h = 0; // current LCP length, can only decrease by 1 each step
for (int i = 0; i < n; i++) {
if (rank_[i] > 0) {
// j = suffix that comes just before suffix i in sorted order
int j = sa[rank_[i] - 1];
// Extend match from current h
while (i + h < n && j + h < n && s[i + h] == s[j + h]) h++;
lcp[rank_[i]] = h;
// Next iteration: h decreases by at most 1
if (h > 0) h--;
}
}
return lcp;
}
int main() {
ios::sync_with_stdio(false);
cin.tie(nullptr);
string s;
cin >> s;
// Append sentinel: must be < all other characters in s.
// This ensures no suffix is a prefix of another suffix,
// which simplifies comparisons.
s += '$';
vector<int> sa = build_sa(s);
vector<int> lcp = build_lcp(s, sa);
int n = s.size();
// Number of distinct substrings:
// Total substrings = n*(n-1)/2 (excluding sentinel suffixes)
// Each lcp[i] accounts for lcp[i] substrings already counted
long long distinct = (long long)n * (n - 1) / 2;
for (int i = 1; i < n; i++) distinct -= lcp[i];
cout << distinct << "\n";
}Operations Reference
The suffix array is a build-once structure -- construction dominates the cost, but every query afterward is fast. This table shows why the upfront investment pays off.
| Operation | Time | Space |
|---|---|---|
| Build suffix array ( | ||
| Build suffix array ( | ||
| Build LCP array (Kasai) | ||
| Count distinct substrings | ||
| Longest repeated substring | ||
| Pattern search (single pattern of length | ||
| LCP of arbitrary pair | ||
| Longest common substring (two strings) |
Problem Patterns & Tricks
Pattern 1: Number of Distinct Substrings
The total number of substrings is
where
cpp
long long ans = 0;
for (int i = 1; i < n; i++)
ans += (sa[i] == 0 ? n : n - sa[i]) - lcp[i];
// each suffix sa[i] contributes (n - sa[i]) substrings minus lcp[i] duplicatesCF examples: CF 802I -- Fake News (hard), SPOJ DISUBSTR.
Pattern 2: Longest Repeated Substring
The longest substring that appears at least twice is the maximum value in the LCP array. To find it, scan
cpp
int best = *max_element(lcp.begin() + 1, lcp.end());
// The repeated substring is s.substr(sa[argmax], best)For "longest substring appearing at least
CF examples: CF 1043G -- Speckled Band, SPOJ REPEATS.
Pattern 3: Pattern Matching via Binary Search on SA
All occurrences of pattern
cpp
// Find range [lo, hi) in sa where s[sa[i]..] starts with pattern p
auto cmp_lo = [&](int sa_i, const string& p) {
return s.compare(sa_i, p.size(), p) < 0;
};
auto cmp_hi = [&](const string& p, int sa_i) {
return s.compare(sa_i, p.size(), p) > 0;
};
int lo = lower_bound(sa.begin(), sa.end(), p, cmp_lo) - sa.begin();
int hi = upper_bound(sa.begin(), sa.end(), p, cmp_hi) - sa.begin();
int count = hi - lo; // number of occurrencesCF examples: CF 149E -- Martian Strings.
Pattern 4: Longest Common Substring of Two Strings
Concatenate
cpp
string t = s1 + '#' + s2 + '$';
auto sa = build_sa(t);
auto lcp = build_lcp(t, sa);
int sep = s1.size(); // index of '#'
int ans = 0;
for (int i = 1; i < (int)t.size(); i++) {
bool from1_prev = sa[i - 1] < sep;
bool from1_curr = sa[i] < sep;
if (from1_prev != from1_curr)
ans = max(ans, lcp[i]);
}CF examples: CF 1526E -- Oolimry, SPOJ LCS.
Longest common substring via concatenation + SA + LCP:
s1 = "abcde" s2 = "bcdef"
Concatenated: "abcde#bcdef$"
├─s1──┤ ├─s2──┤
0 4 5 6 10 11
After SA + LCP construction, scan adjacent pairs:
sa[i-1] sa[i] From LCP
──────── ──────── ─────── ────
...
1 6 s1, s2 4 <- "bcde" in both!
...
Answer: max LCP where sa[i-1] and sa[i] come from
different strings = 4 -> "bcde"
Separator '#' ensures no false cross-boundary matches.Pattern 5: SA + Sparse Table RMQ for Arbitrary LCP Queries
To find the LCP of any two suffixes
cpp
// Sparse table over lcp[] for O(1) RMQ
vector<vector<int>> sp(LOG, vector<int>(n));
sp[0] = lcp;
for (int j = 1; j < LOG; j++)
for (int i = 0; i + (1 << j) <= n; i++)
sp[j][i] = min(sp[j-1][i], sp[j-1][i + (1 << (j-1))]);
auto query = [&](int l, int r) { // LCP of sa[l] and sa[r], l < r
l++; // lcp is between consecutive, so range is [l+1..r]
int k = __lg(r - l + 1);
return min(sp[k][l], sp[k][r - (1 << k) + 1]);
};CF examples: CF 873F -- Forbidden Indices.
Pattern 6: Counting Substrings with Property via SA + Binary Search
Many problems ask "how many substrings satisfy X?" where X depends on length or lexicographic order. Group suffixes by LCP ranges. Binary search on the LCP array to find how far a common prefix extends. Often combined with monotonic stack on LCP values.
CF examples: CF 432D -- Prefixes and Suffixes, CF 427D -- Match & Catch.
Contest Cheat Sheet
+--------------------------------------------------------------+
| SUFFIX ARRAY CHEAT SHEET |
+--------------------------------------------------------------+
| WHEN TO USE: |
| - Count distinct substrings --> SA + LCP sum |
| - Longest repeated substring --> max(lcp[]) |
| - Pattern matching (multi-query) --> binary search on SA |
| - Longest common substring (2 str) --> concat + SA + LCP |
| - Any "all substrings" query --> SA + LCP + RMQ |
+--------------------------------------------------------------+
| CONSTRUCTION: |
| s += '$'; // sentinel, smaller than all chars |
| sa = build_sa(s); // O(n log^2 n) or O(n log n) |
| lcp = build_lcp(s, sa); // Kasai, O(n) |
+--------------------------------------------------------------+
| KEY FORMULAS: |
| distinct substrings = n*(n-1)/2 - sum(lcp[1..n-1]) |
| longest repeated = max(lcp[1..n-1]) |
| occurrences of P = upper_bound - lower_bound on SA |
+--------------------------------------------------------------+
| COMPLEXITY: |
| Build SA: O(n log^2 n) time, O(n) space |
| Build LCP: O(n) time, O(n) space |
| Pattern: O(m log n) per query |
| RMQ on LCP: O(n log n) build, O(1) per query |
+--------------------------------------------------------------+
| WATCH OUT: |
| - Append sentinel '$' before building |
| - lcp[0] is undefined (no predecessor) |
| - rank[] is INVERSE of sa[]: rank[sa[i]] = i |
+--------------------------------------------------------------+Gotchas & Debugging
Forgetting the Sentinel
Without a sentinel character smaller than all characters in '$' (or character 0) and adjust your index arithmetic accordingly.
Off-by-One in LCP
lcp[i] is the LCP between sa[i-1] and sa[i]. There is no lcp[0] (no predecessor). Loops should start at i = 1.
Rank vs SA Confusion
sa[i]= starting index of the-th suffix in sorted order. rank[i]= position of suffix starting at indexin the sorted order. - They are inverses:
rank[sa[i]] = iandsa[rank[i]] = i.
Mixing them up causes silent wrong answers. Print both arrays during debugging.
Sentinel in Distinct Substring Count
If you appended '$', the sentinel contributes extra suffixes. The formula '$'. Adjust: use original string length, or subtract suffixes containing the sentinel.
Integer Overflow
For strings of length int. Use long long.
Debugging Tips
- Print the sorted suffixes alongside the SA and LCP arrays for small inputs.
- Verify
rank[sa[i]] == ifor all. - Verify
lcp[i] >= 0andlcp[i] < nfor all valid. - Test on
"aaaa"(all same characters) and"abcde"(all distinct) as corner cases.
Mental Traps
Trap 1: "Suffix array is just sorting suffixes -- why is it hard?"
Sorting
Naive sort vs. prefix-doubling:
Naive:
┌─────────────────────────────────────┐
│ n suffixes x O(n) per comparison │
│ x O(n log n) comparisons │
│ = O(n² log n) │ <- TLE for n >= 10⁴
└─────────────────────────────────────┘
Prefix-doubling:
┌─────────────────────────────────────┐
│ O(log n) rounds │
│ x O(n log n) sort per round │
│ x O(1) per comparison (rank pairs) │
│ = O(n log² n) │ <- OK for n <= 10⁵
└─────────────────────────────────────┘Trap 2: "I'll compute LCP on-the-fly instead of building the array."
For a single pair, computing LCP by character comparison is fine:
Common Mistakes
- Kasai iteration order. "My LCP array had garbage values for some positions, and distinct-substring count was way off..." In Kasai's algorithm, iterating in suffix-array order (
for i in sa[0], sa[1], ...) instead of text order (for i in 0, 1, ..., n-1) breaks the amortization. Kasai's algorithm exploits the fact thatlcp[rank[i]] >= lcp[rank[i-1]] - 1-- this inequality only holds when iterating by text position. Iterating in SA order produces wrong values.
Debug Drills
Drill 1 -- Kasai's algorithm: wrong iteration order
cpp
int k = 0;
for (int i = 0; i < n; i++) {
int j = sa[rank[i] - 1]; // BUG: no bounds check
while (s[i + k] == s[j + k]) k++;
lcp[rank[i]] = k;
if (k) k--;
}What's wrong?
When rank[i] == 0, rank[i] - 1 = -1 is out of bounds. The first suffix in sorted order has no predecessor, so its LCP is 0. Add a guard:
cpp
if (rank[i] == 0) { k = 0; continue; }Drill 2 -- Suffix array sort: not breaking ties correctly
cpp
// Doubling step: sort by (rank[i], rank[i + w])
sort(sa.begin(), sa.end(), [&](int a, int b) {
return rank[a] < rank[b] ||
(rank[a] == rank[b] && rank[a + w] < rank[b + w]); // BUG
});What's wrong?
When a + w >= n or b + w >= n, rank[a + w] is out of bounds. Suffixes that are shorter than w remaining characters should be treated as having rank -1 (or INT_MIN):
cpp
auto second = [&](int x) { return x + w < n ? rank[x + w] : -1; };
sort(sa.begin(), sa.end(), [&](int a, int b) {
return rank[a] < rank[b] || (rank[a] == rank[b] && second(a) < second(b));
});Drill 3 -- Distinct substring count: off-by-one with LCP sum
cpp
long long total = (long long)n * (n + 1) / 2;
for (int i = 0; i < n; i++) // BUG: LCP[0] is usually 0 or undefined
total -= lcp[i];What's wrong?
Depending on convention, lcp[0] might be undefined (it's the LCP between the "suffix before the first" and the first -- which doesn't exist). Typically lcp[0] = 0 by definition, so this may or may not be a bug. But if your LCP array is 1-indexed (common in some templates), you should iterate from i = 1. Verify your template's convention.
Self-Test
Before moving to problems, verify you can do these without looking at the notes:
- [ ] Implement
SA construction from memory: initialize ranks from characters, iterate doubling rounds, sort by (rank[i], rank[i+k])pairs, re-rank, early-terminate when all ranks are distinct - [ ] Implement Kasai's LCP algorithm from memory, including the inverse-SA computation and the
h--step - [ ] State the inverse relationship
rank[sa[i]] = iand verify on"banana$"-- explain what bug arises from confusingsaandrank - [ ] Compute distinct substrings from SA + LCP:
, and explain why lcp[0]is excluded - [ ] Describe how to find all occurrences of pattern
in string via binary search on the suffix array, and state the per-query complexity
Practice Problems
| CF Rating | What You Should Be Able to Do |
|---|---|
| 1200 | Understand what a suffix array is conceptually; sort suffixes by brute force. |
| 1500 | Implement |
| 1800 | Use SA + LCP + RMQ for pattern search, counting distinct substrings, and longest common substring of two strings. |
| 2100 | Combine SA + LCP with monotonic stack, binary search, or segment tree for advanced problems (e.g., sum of distinct substring lengths, LCP-based grouping). |
| # | Problem | Source | Difficulty | Key Idea |
|---|---|---|---|---|
| 1 | Distinct Substrings | SPOJ | 1400 | SA + LCP, count distinct substrings |
| 2 | Longest Common Substring | SPOJ | 1500 | Concatenate two strings, SA + LCP scan |
| 3 | String Matching | CSES | 1500 | Binary search on SA for pattern |
| 4 | Prefixes and Suffixes | CF 432D | 1900 | SA + LCP + counting |
| 5 | Match & Catch | CF 427D | 2000 | SA + LCP on concatenated strings |
| 6 | Forbidden Indices | CF 873F | 2100 | SA + LCP + sparse table RMQ |
| 7 | Martian Strings | CF 149E | 2300 | SA + LCP, substring search |
| 8 | Repeats | SPOJ | 2200 | SA + LCP + binary search on length |
| 9 | Speckled Band | CF 1043G | 2400 | SA + LCP, heavy case analysis |
| 10 | Fake News (hard) | CF 802I | 2300 | SA + LCP + monotonic stack |
Further Reading
- cp-algorithms: Suffix Array --
construction with radix sort, applications, and LCP. - cp-algorithms: Kasai's Algorithm -- detailed proof of the
LCP construction. - CF Blog: Suffix Array and LCP -- tutorial with worked examples.
- CF Blog: Suffix Structures -- comparison of suffix array vs suffix tree vs suffix automaton.
- See:
02-kmp-and-z-function.mdfor simpler string matching when SA is overkill. - See:
01-string-hashing.mdfor probabilistic substring comparison as a lighter alternative. - See:
12-sparse-table-rmq.mdfor the RMQ structure often paired with LCP arrays.