Appearance
Suffix Automaton
Quick Revisit
- USE WHEN: Counting distinct substrings, longest common substring, or any query over all substrings of a string
- INVARIANT: Each state = equivalence class of substrings sharing the same endpos set; suffix link points to the longest proper suffix in a different class
- TIME: O(n) online construction | SPACE: O(n × σ)
- CLASSIC BUG: Forgetting to clone the state when the existing transition target has len ≠ cur.len + 1 (leads to wrong endpos sets)
- PRACTICE: 07-ladder-strings
The smallest deterministic finite automaton (DFA) that accepts exactly all suffixes of a string. Its states correspond to equivalence classes of substrings sharing the same set of ending positions, giving a structure with at most
Difficulty: [Advanced]Prerequisites: String Hashing, Suffix Array, TrieCF Rating Relevance: 1900-2100+
Contents
- Intuition
- C++ STL Reference
- Implementation (Contest-Ready)
- Operations Reference
- Problem Patterns & Tricks
- Contest Cheat Sheet
- Gotchas & Debugging
- Self-Test
- Practice Problems
- Further Reading
Intuition
"Count the number of distinct substrings of "abab"."
The brute-force plan: generate every substring, dump them into a set, return the size.
a, b, a, b, ab, ba, ab, aba, bab, abab, ...
Set: {a, b, ab, ba, aba, bab, abab} --> 7There are
We need a structure that captures all substrings without enumerating them one by one.
A suffix automaton is the smallest DFA that accepts exactly all suffixes of a string -- and its states correspond to equivalence classes of substrings, giving
Every substring of
For counting distinct substrings, each state
Visual Walkthrough
Build SAM for "abab" incrementally. Each step adds one character to the end and extends the automaton.
Notation: Each state shows its id, len (longest substring in the class), and link (suffix link). Transitions are labeled with characters.
Step 0 -- Initial state (empty string):
State 0: len=0, link=-1 (initial)
last = 0Step 1 -- Append 'a':
Create state 1: len=1, link=0
Add transition 0 --a--> 1
last = 1
Automaton:
(0) --a--> (1)
Substrings: {a} count = 1Step 2 -- Append 'b':
Create state 2: len=2, link=0
Add transition 1 --b--> 2
Walk suffix links from old last=1:
state 1 has no 'b' -> add 0 --b--> 2
state 0 has no 'b' -> (already added above, stop at -1)
last = 2
Automaton:
(0) --a--> (1) --b--> (2)
| ^
+----------b-----------+
Substrings: {a, b, ab} count = 3Step 3 -- Append 'a':
Create state 3: len=3, link=?
Add transition 2 --a--> 3
Walk suffix links from old last=2:
state 0 has 'a' -> goes to state 1
Check: len(1) == len(0) + 1? 1 == 0+1? YES
So link(3) = 1
last = 3
Automaton:
(0) --a--> (1) --b--> (2) --a--> (3)
| ^
+----------b-----------+
Suffix links: 3->1, 2->0, 1->0
Substrings: {a, b, ab, ba, aba} count = 5Step 4 -- Append 'b':
Create state 4: len=4, link=?
Add transition 3 --b--> 4
Walk suffix links from old last=3:
state 1 has 'b' -> goes to state 2
Check: len(2) == len(1) + 1? 2 == 1+1? YES
So link(4) = 2
last = 4
Automaton:
(0) --a--> (1) --b--> (2) --a--> (3) --b--> (4)
| ^
+----------b-----------+
Suffix links: 4->2, 3->1, 2->0, 1->0
Substrings: {a, b, ab, ba, aba, bab, abab} count = 7Final state diagram:
a b a b
(0) -------> (1) -----> (2) -----> (3) -----> (4)
| ^
+----------b-------------+
Suffix links (dashed):
(4) -.-> (2) -.-> (0)
(3) -.-> (1) -.-> (0)
State | len | link | Represents (endpos class)
------+-----+------+----------------------------
0 | 0 | -1 | {empty string}
1 | 1 | 0 | "a" endpos={1,3}
2 | 2 | 0 | "b","ab" endpos={2,4}
3 | 3 | 1 | "ba","aba" endpos={3}
4 | 4 | 2 | "bab","abab" endpos={4}
Distinct substrings = sum of (len[v] - len[link[v]]):
(1-0) + (2-0) + (3-1) + (4-2) = 1 + 2 + 2 + 2 = 7 OKA second example: "abb" forces a clone
The walkthrough above is clean, but it never triggered cloning -- and cloning is the case that actually decides whether your extend() is correct. Here is the smallest canonical example that does.
Steps 1-2: append 'a', then 'b'. Identical to the first walkthrough.
States 0, 1 ("a"), 2 ("b","ab"). last = 2.
Step 3: append the SECOND 'b'.
Create state 3 with len = 3.
Walk suffix links from p = last = 2:
state 2 has no 'b' transition -> add 2 --b--> 3. p = link(2) = 0.
state 0 already has 'b' -> goes to state 2. Stop.
Now we examine q = 2:
len(q) = 2, len(p) + 1 = 0 + 1 = 1. THEY DIFFER.
The transition 0 --b--> 2 was created when "ab" first appeared, with len=2.
But for the new context (entering on 'b' alone, not via 'a'), the
transition target should have len = 1 -- not len = 2.
The state 2 is overloaded: it currently mixes two endpos classes
that the new character has separated. So we clone.
Create state 4 = copy of state 2 (same transitions, same suffix link).
Set len(4) = 1 (the correct length for the "just b" context).
Redirect every ancestor's 'b' transition that pointed to 2 to point to 4
instead -- but only while the transition still goes to 2. Walking up
from p = 0: 0 --b--> 2, redirect to 0 --b--> 4. link(0) = -1, stop.
Set link(2) = 4 and link(3) = 4.
last = 3.
Final automaton ("abb"):
State | len | link | Represents
------+-----+------+------------
0 | 0 | -1 | empty
1 | 1 | 0 | "a" endpos = {0}
2 | 2 | 4 | "ab" endpos = {1} <- only "ab", not "b"
3 | 3 | 4 | "abb","bb" endpos = {2}
4 | 1 | 0 | "b" endpos = {1, 2} <- the clone
Without cloning, state 2 would have claimed endpos = {1, 2}, but
that would say "ab" ends at position 2 -- false. The clone splits
the overloaded state into "b" alone (positions 1, 2) and "ab"
(position 1 only).The pattern to remember:
A clone is needed exactly when the existing transition target
has . That inequality means used to carry a longer string in its endpos class -- and the new character has just exposed that the shorter version of 's class has a strictly larger endpos set than the longer version. Splitting them is what cloning does.
Suffixes vs. substrings: two views of the same automaton
The SAM is sometimes described as "the smallest DFA accepting all suffixes of
- Accepting states represent suffixes. A state is terminal (accepting) if it lies on the suffix-link chain from
lastafter the final character is added. The set of strings spelled along paths from state 0 to a terminal state equals the set of suffixes of. This is the DFA-theoretic identity. - Paths from the start spell substrings. Every distinct substring of
corresponds to exactly one path from state 0, regardless of whether the endpoint is terminal. So "is a substring of ?" reduces to "does the path labelled from state 0 exist?" -- you do not need terminal information.
Counting distinct substrings uses the path view (what
Two views of the same SAM (string "abab"):
TRANSITION DAG (for matching): SUFFIX-LINK TREE (for counting):
Follow edges to spell substrings. Propagate counts toward root.
a b a b State 0 (len=0)
(0)───>(1)───>(2)───>(3)───>(4) ╱ ╲
│ ▲ State 1(len=1) State 2(len=2)
└──────b───────┘ │ │
State 3(len=3) State 4(len=4)
Use DAG for: "is P a substring?"
Use tree for: "how many times endpos(4) ⊂ endpos(2) ⊂ endpos(0)
does P appear?" endpos(3) ⊂ endpos(1) ⊂ endpos(0)The Invariant
Each state represents an equivalence class of substrings that appear at the same set of ending positions (the
endposset). The suffix link of statepoints to the longest proper suffix of 's longest string that belongs to a different equivalence class. For every state : , and the endposset ofis a strict subset of the endposset of.
This nesting of endpos sets along suffix links forms a tree (the "parent tree" or "suffix link tree"). The tree has at most
Endpos sets nest along suffix links (string "abab"):
State 4: "abab","bab" endpos = {3}
│ suffix link
▼
State 2: "ab","b" endpos = {1, 3}
│ suffix link
▼
State 0: "" endpos = {0, 1, 2, 3}
State 3: "aba","ba" endpos = {2}
│ suffix link
▼
State 1: "a" endpos = {0, 2}
│ suffix link
▼
State 0: "" endpos = {0, 1, 2, 3}
Key property: endpos(child) ⊂ endpos(parent)
This nesting forms a TREE -- the suffix-link tree.
At most 2n - 1 nodes (states).What the Code Won't Teach You
States are equivalence classes, not individual substrings. Each state represents all substrings sharing the same endpos set (the positions where they end in the original string). One state can represent many substrings -- specifically all those with lengths in
Clone states vs. real states matter for counting. During extend(), when a state is cloned, the clone inherits transitions and the suffix link but must NOT inherit the occurrence count. Clones are structural -- they split an equivalence class but don't represent a new suffix endpoint. If you copy cnt into clones, occurrence counts are silently inflated. The code does this correctly, but doesn't explain why.
The suffix-link tree is where the real power lives. Most SAM queries (occurrence counting, endpos set computation, DP over substrings) operate on the suffix-link tree, not on the transition DAG. The transition DAG is for matching (walking a pattern through the automaton). The suffix-link tree is for counting (propagating information from children to parents). Confusing the two leads to wrong DP formulations.
The "aha moment": endpos sets are a partition of reality. The key insight that makes the SAM click is understanding why there are only extend() operation stops being arbitrary case analysis and becomes "we're maintaining the endpos partition as we add a character." Every case in extend corresponds to a specific way the partition changes.
Topological order is not optional for DP. A surprisingly common bug: running a DP on the SAM states without processing them in the right order. For propagating counts up the suffix-link tree, you need reverse topological order (children before parents). Sorting states by len in decreasing order gives this. For counting paths through the transition DAG (e.g., len is
The SAM subsumes many simpler string structures. The suffix-link tree of the SAM is isomorphic to the suffix tree of the reversed string. The number of distinct substrings, the number of occurrences of any pattern, the longest repeated substring -- all are direct queries on the SAM. Once you're fluent with the SAM, you'll find yourself reaching for it even on problems that "should" use simpler tools, because the unified framework means less context-switching. The danger is overengineering: if KMP or hashing solves the problem in 15 lines, don't build a 60-line SAM out of habit.
When to Reach for This
Trigger phrases in problem statements:
- "count distinct substrings"
SAM, sum over all states. - "longest common substring of two strings"
build SAM on one, walk the other through it. - "online string processing" (characters appended one at a time)
SAM's incremental construction. - "number of occurrences of each substring"
SAM + counting terminal paths in the suffix link tree. - "smallest cyclic shift"
build SAM on , find the lexicographically smallest path of length . - "
-th lexicographically smallest substring" SAM with precomputed path counts.
Counter-examples -- do NOT use SAM when:
- Single pattern search
KMP or Z-function is simpler. - Multiple patterns in a dictionary
Aho-Corasick is better suited. - Only need suffix array + LCP and problem is offline
Suffix Array may be simpler. - String is short (
) brute force or DP may suffice.
C++ STL Reference
| Function / Type | Header | What it does | Time Complexity |
|---|---|---|---|
map<char, int> | <map> | Sparse transition table per state (large alphabet). | |
array<int, 26> or int next[26] | <array> | Dense transition table (lowercase-only alphabet). | |
vector<int> | <vector> | Store suffix link tree adjacency, topological order. | |
fill(first, last, val) | <algorithm> | Initialize transition arrays to | |
reverse(first, last) | <algorithm> | Reverse topological order for DP on suffix link tree. | |
string::push_back(c) | <string> | Append character during online construction. | |
memset(ptr, val, size) | <cstring> | Fast initialization of transition arrays. |
Implementation (Contest-Ready)
Version 1: Minimal SAM -- Struct-Based with Dense Transitions
cpp
#include <bits/stdc++.h>
using namespace std;
struct SuffixAutomaton {
struct State {
int len, link;
int next[26];
State() : len(0), link(-1) { memset(next, -1, sizeof(next)); }
};
vector<State> st;
int last;
void init() {
st.clear();
st.emplace_back(); // state 0: initial
last = 0;
}
void extend(int c) {
// If transition already exists from last on c (clone case shortcut)
if (st[last].next[c] != -1) {
int q = st[last].next[c];
if (st[q].len == st[last].len + 1) {
last = q;
return;
}
int clone = st.size();
st.push_back(st[q]);
st[clone].len = st[last].len + 1;
while (last != -1 && st[last].next[c] == q) {
st[last].next[c] = clone;
last = st[last].link;
}
st[q].link = clone;
last = clone;
return;
}
int cur = st.size();
st.emplace_back();
st[cur].len = st[last].len + 1;
int p = last;
while (p != -1 && st[p].next[c] == -1) {
st[p].next[c] = cur;
p = st[p].link;
}
if (p == -1) {
st[cur].link = 0;
} else {
int q = st[p].next[c];
if (st[q].len == st[p].len + 1) {
st[cur].link = q;
} else {
int clone = st.size();
st.push_back(st[q]); // copy transitions + link
st[clone].len = st[p].len + 1;
while (p != -1 && st[p].next[c] == q) {
st[p].next[c] = clone;
p = st[p].link;
}
st[q].link = clone;
st[cur].link = clone;
}
}
last = cur;
}
void build(const string& s) {
init();
st.reserve(2 * s.size());
for (char c : s) extend(c - 'a');
}
// Count distinct substrings
long long count_distinct() {
long long ans = 0;
for (int i = 1; i < (int)st.size(); i++)
ans += st[i].len - st[st[i].link].len;
return ans;
}
};
int main() {
ios::sync_with_stdio(false);
cin.tie(nullptr);
string s;
cin >> s;
SuffixAutomaton sam;
sam.build(s);
cout << sam.count_distinct() << "\n";
}Version 2: SAM with Occurrence Counting and Size-Tracking
cpp
#include <bits/stdc++.h>
using namespace std;
struct SAM {
struct State {
int len, link;
long long cnt; // number of times this class occurs
int next[26];
State() : len(0), link(-1), cnt(0) {
memset(next, -1, sizeof(next));
}
};
vector<State> st;
int last;
void init() {
st.clear();
st.emplace_back();
last = 0;
}
void extend(int c) {
if (st[last].next[c] != -1) {
int q = st[last].next[c];
if (st[q].len == st[last].len + 1) {
last = q;
st[last].cnt++;
return;
}
int clone = st.size();
st.push_back(st[q]);
st[clone].len = st[last].len + 1;
st[clone].cnt = 0;
while (last != -1 && st[last].next[c] == q) {
st[last].next[c] = clone;
last = st[last].link;
}
st[q].link = clone;
last = clone;
st[last].cnt++;
return;
}
int cur = st.size();
st.emplace_back();
st[cur].len = st[last].len + 1;
st[cur].cnt = 1;
int p = last;
while (p != -1 && st[p].next[c] == -1) {
st[p].next[c] = cur;
p = st[p].link;
}
if (p == -1) {
st[cur].link = 0;
} else {
int q = st[p].next[c];
if (st[q].len == st[p].len + 1) {
st[cur].link = q;
} else {
int clone = st.size();
st.push_back(st[q]);
st[clone].len = st[p].len + 1;
st[clone].cnt = 0;
while (p != -1 && st[p].next[c] == q) {
st[p].next[c] = clone;
p = st[p].link;
}
st[q].link = clone;
st[cur].link = clone;
}
}
last = cur;
}
// Propagate counts up the suffix link tree
// (topological sort by len, process in reverse order)
void calc_cnt() {
int n = st.size();
vector<int> order(n);
iota(order.begin(), order.end(), 0);
sort(order.begin(), order.end(), [&](int a, int b) {
return st[a].len > st[b].len;
});
for (int v : order) {
if (st[v].link >= 0)
st[st[v].link].cnt += st[v].cnt;
}
}
void build(const string& s) {
init();
st.reserve(2 * s.size());
for (char c : s) extend(c - 'a');
calc_cnt();
}
};
int main() {
ios::sync_with_stdio(false);
cin.tie(nullptr);
string s;
cin >> s;
SAM sam;
sam.build(s);
// Find the longest substring occurring at least k times
int k;
cin >> k;
int ans = 0;
for (int i = 1; i < (int)sam.st.size(); i++)
if (sam.st[i].cnt >= k)
ans = max(ans, sam.st[i].len);
cout << ans << "\n";
}Version 3: SAM with Map Transitions (Large / Arbitrary Alphabet)
cpp
#include <bits/stdc++.h>
using namespace std;
struct SAM {
struct State {
int len, link;
map<int, int> next;
State() : len(0), link(-1) {}
};
vector<State> st;
int last;
void init() {
st.clear();
st.emplace_back();
last = 0;
}
void extend(int c) {
if (st[last].next.count(c)) {
int q = st[last].next[c];
if (st[q].len == st[last].len + 1) {
last = q;
return;
}
int clone = st.size();
st.push_back(st[q]);
st[clone].len = st[last].len + 1;
while (last != -1 && st[last].next.count(c) && st[last].next[c] == q) {
st[last].next[c] = clone;
last = st[last].link;
}
st[q].link = clone;
last = clone;
return;
}
int cur = st.size();
st.emplace_back();
st[cur].len = st[last].len + 1;
int p = last;
while (p != -1 && !st[p].next.count(c)) {
st[p].next[c] = cur;
p = st[p].link;
}
if (p == -1) {
st[cur].link = 0;
} else {
int q = st[p].next[c];
if (st[q].len == st[p].len + 1) {
st[cur].link = q;
} else {
int clone = st.size();
st.push_back(st[q]);
st[clone].len = st[p].len + 1;
while (p != -1 && st[p].next.count(c) && st[p].next[c] == q) {
st[p].next[c] = clone;
p = st[p].link;
}
st[q].link = clone;
st[cur].link = clone;
}
}
last = cur;
}
};Operations Reference
The table below summarizes every common SAM query and its cost -- use it to quickly verify that your approach fits within time and memory limits.
| Operation | Time | Space |
|---|---|---|
| Build SAM (incremental, alphabet | ||
| Count distinct substrings | ||
| Count occurrences of each endpos class | ||
| Longest common substring (two strings) | ||
| Check if pattern | $O( | p |
| Smallest cyclic shift | ||
| Total occurrences of a pattern | $O( | p |
Problem Patterns & Tricks
Pattern 1: Count Distinct Substrings
Each state
cpp
long long count_distinct(const SAM& sam) {
long long ans = 0;
for (int i = 1; i < (int)sam.st.size(); i++)
ans += sam.st[i].len - sam.st[sam.st[i].link].len;
return ans;
}CF examples: CF 123D -- String, SPOJ DISUBSTR.
Pattern 2: Longest Common Substring of Two Strings
Build SAM on string
cpp
int lcs(const SAM& sam, const string& t) {
int v = 0, cur_len = 0, best = 0;
for (char ch : t) {
int c = ch - 'a';
while (v != 0 && sam.st[v].next[c] == -1) {
v = sam.st[v].link;
cur_len = sam.st[v].len;
}
if (sam.st[v].next[c] != -1) {
v = sam.st[v].next[c];
cur_len++;
}
best = max(best, cur_len);
}
return best;
}For
CF examples: SPOJ LCS2 -- Longest Common Substring II, CF 427D -- Match & Catch.
Pattern 3: Occurrence Counting
After building the SAM, propagate counts up the suffix link tree. A state that was "last" after some character was appended gets count 1. Cloned states start with count 0. Propagating in reverse topological order (longest
Then st[v].cnt is the number of times the endpos class of
cpp
// After build + calc_cnt:
// st[v].cnt = number of occurrences of the longest string in class v
// = |endpos(v)|To find the number of occurrences of a specific pattern, walk the pattern through the SAM to find its state, then read st[v].cnt.
CF examples: CF 123D -- String, CF 802I -- Fake News (hard).
Occurrence count propagation on suffix-link tree:
String "abab" -- propagate in reverse topological order
(longest len first -> shortest len last)
Step 1: Process state 4 (len=4, cnt=1)
Add cnt to parent state 2: cnt[2] += 1
Step 2: Process state 3 (len=3, cnt=1)
Add cnt to parent state 1: cnt[1] += 1
Step 3: Process state 2 (len=2, cnt=1+1=2)
Add cnt to parent state 0: cnt[0] += 2
Step 4: Process state 1 (len=1, cnt=1+1=2)
Add cnt to parent state 0: cnt[0] += 2
Final counts:
State 0 (""): cnt = 4 (endpos size = 4 OK)
State 1 ("a"): cnt = 2 (appears at pos 0,2 OK)
State 2 ("b","ab"): cnt = 2 (appear at pos 1,3 OK)
State 3 ("ba","aba"): cnt = 1 (appears at pos 2 OK)
State 4 ("bab","abab"): cnt = 1 (appears at pos 3 OK)Pattern 4: Smallest Cyclic Shift
Concatenate
cpp
string smallest_cyclic(const string& s) {
int n = s.size();
string t = s + s;
SAM sam;
sam.init();
sam.st.reserve(2 * t.size());
for (char c : t) sam.extend(c - 'a');
string res;
int v = 0;
for (int i = 0; i < n; i++) {
for (int c = 0; c < 26; c++) {
if (sam.st[v].next[c] != -1) {
res += (char)('a' + c);
v = sam.st[v].next[c];
break;
}
}
}
return res;
}CF examples: CF 235C -- Cyclical Quest (related).
Pattern 5: -th Smallest Distinct Substring
Precompute for each state the number of distinct substrings reachable via transitions (DFS/DP on the DAG of transitions). Then walk from state 0, at each step choosing the transition whose subtree count brackets
cpp
// dp[v] = number of distinct substrings reachable from state v (including v itself)
void calc_dp(const SAM& sam, vector<long long>& dp) {
int n = sam.st.size();
dp.assign(n, 0);
// topological order by transitions (BFS or DFS)
vector<int> order;
vector<int> indeg(n, 0);
for (int v = 0; v < n; v++)
for (int c = 0; c < 26; c++)
if (sam.st[v].next[c] != -1)
indeg[sam.st[v].next[c]]++;
queue<int> q;
for (int v = 0; v < n; v++)
if (indeg[v] == 0) q.push(v);
while (!q.empty()) {
int v = q.front(); q.pop();
order.push_back(v);
for (int c = 0; c < 26; c++)
if (sam.st[v].next[c] != -1)
if (--indeg[sam.st[v].next[c]] == 0)
q.push(sam.st[v].next[c]);
}
reverse(order.begin(), order.end());
for (int v : order) {
dp[v] = 1; // count state v itself (empty string for root, else a substring)
for (int c = 0; c < 26; c++)
if (sam.st[v].next[c] != -1)
dp[v] += dp[sam.st[v].next[c]];
}
dp[0]--; // don't count the empty string
}
string kth_substring(const SAM& sam, const vector<long long>& dp, long long k) {
string res;
int v = 0;
while (k > 0) {
for (int c = 0; c < 26; c++) {
int u = sam.st[v].next[c];
if (u == -1) continue;
if (dp[u] < k) {
k -= dp[u];
} else {
res += (char)('a' + c);
v = u;
k--; // count this node
break;
}
}
}
return res;
}CF examples: CF 128B -- String.
Pattern 6: Substring Matching and Testing
To check whether pattern
cpp
bool is_substring(const SAM& sam, const string& p) {
int v = 0;
for (char ch : p) {
int c = ch - 'a';
if (sam.st[v].next[c] == -1) return false;
v = sam.st[v].next[c];
}
return true;
}This is
Contest Cheat Sheet
+--------------------------------------------------------------+
| SUFFIX AUTOMATON CHEAT SHEET |
+--------------------------------------------------------------+
| WHEN TO USE: |
| - Count distinct substrings --> sum(len[v]-len[link[v]]) |
| - Longest common substring --> walk 2nd string on SAM |
| - Occurrence count per substr--> propagate cnt up link tree |
| - Smallest cyclic shift --> SAM on s+s, greedy walk |
| - k-th smallest substring --> DP on transition DAG |
| - Online (append chars) --> extend() is O(1) amort. |
+--------------------------------------------------------------+
| CONSTRUCTION: |
| SAM sam; |
| sam.init(); |
| for (char c : s) sam.extend(c - 'a'); |
+--------------------------------------------------------------+
| KEY FORMULAS: |
| States: <= 2n - 1 |
| Transitions: <= 3n - 4 |
| Distinct subs = sum over v!=0 of (len[v] - len[link[v]]) |
| Occurrences = |endpos(v)| = cnt[v] after propagation |
+--------------------------------------------------------------+
| SUFFIX LINK TREE: |
| endpos(v) c endpos(link(v)) (strict subset) |
| len(link(v)) < len(v) |
| Children partition parent's endpos set |
+--------------------------------------------------------------+
| COMPLEXITY: |
| Build: O(n) amortized (dense alpha) |
| Memory: O(n * sigma) with arrays, O(n) with map |
| Query: O(|pattern|) for membership |
+--------------------------------------------------------------+
| WATCH OUT: |
| - Clone states get cnt=0 (they are not "real" endpoints) |
| - Propagate cnt in REVERSE topo order (longest len first) |
| - SAM has NO sentinel -- unlike suffix array |
| - st[0] is the initial state (empty string), skip in sums |
+--------------------------------------------------------------+Gotchas & Debugging
Clone States and Counting
When a state is cloned during extend(), the clone inherits transitions and the suffix link but must not inherit the occurrence count. Clones represent the same equivalence class being split -- only the original "real" extension gets count 1. If you copy cnt into clones, every occurrence count will be inflated.
Forgetting to Propagate Counts
Raw counts after construction only mark states that were last at some step. To get the total number of occurrences of a substring class, you must propagate counts up the suffix link tree in reverse topological order. Without this step, cnt values are 0 for most states.
Transition Array vs Map
Using int next[26] is fast (map<int,int> at the cost of an
Off-by-One: State 0 in Sums
State 0 is the initial state representing the empty string. When computing distinct substrings, skip state 0 -- it contributes
Memory: Reserve Capacity
The SAM can have up to st.reserve(2 * n) before building to avoid repeated reallocations that cause TLE and invalidated references.
Verifying Correctness
For small inputs, verify:
- State count
. - Transition count
. - For every state
: . - The suffix link tree is a valid rooted tree with root at state 0.
- Distinct substring count matches brute force.
cpp
// Quick sanity check
void verify(const SAM& sam, const string& s) {
int n = s.size();
assert((int)sam.st.size() <= 2 * n - 1 + 1); // +1 for state 0
for (int v = 1; v < (int)sam.st.size(); v++) {
assert(sam.st[v].link >= 0);
assert(sam.st[sam.st[v].link].len < sam.st[v].len);
}
// Compare with brute-force distinct count
set<string> brute;
for (int i = 0; i < n; i++)
for (int j = i + 1; j <= n; j++)
brute.insert(s.substr(i, j - i));
long long sam_cnt = 0;
for (int i = 1; i < (int)sam.st.size(); i++)
sam_cnt += sam.st[i].len - sam.st[sam.st[i].link].len;
assert(sam_cnt == (long long)brute.size());
}Debugging Tips
- Print all states with
len,link, and transitions for strings of length. - Test on
"aaaa"(all same),"abcd"(all distinct), and"abcbc"(has clones). - If WA on occurrence counting, print the suffix link tree and verify
endpossets.
Mental Traps
Trap 1: "The suffix automaton is like a trie of suffixes."
A suffix trie has endpos sets into a single state. Thinking "trie" gives you the wrong mental model for what states represent, why cloning exists, and how counting works.
Suffix trie of "aba": Suffix automaton of "aba":
(every prefix of every suffix) (endpos equivalence classes)
root (0)──a──>(1)──b──>(2)──a──>(3)
/ | \ │ ▲
a b a └────b─────────────┘
| |
ba a States: 4 (vs. trie's ~9 nodes)
| Represents 6 substrings in 4 classes:
ba "a"->{0,2}, "b","ab"->{1},
"ba","aba"->{2}, "a" (end)->...
~9 nodes for n=3 <= 2n-1 = 5 nodes for n=3Trap 2: "I'll just paste the template and figure out queries later."
The construction is ~40 lines. But every query requires understanding what states represent. "Count distinct substrings" needs sum(len[v] - len[link[v]]). "Count occurrences" needs propagation on the suffix-link tree. "k-th substring" needs DP on the transition DAG. Without understanding states-as-equivalence-classes, you can't write any of these -- you can only copy them. And copied queries break when the problem has even a slight twist.
Common Mistakes
- Forgetting transition redirect after clone. "My SAM had the right number of states but gave wrong distinct-substring counts..." After cloning state
qintoclone, you must redirect all transitions of ancestors ofpthat pointed toq(for the current character) to point tocloneinstead. Without this redirect, some paths bypass the clone and double-count substrings. The SAM invariant: after cloning, every ancestor ofpthat had a transition toqon charactercmust now point toclone. Walk up the suffix-link chain frompand redirect until you hit a state that no longer points toq.
Debug Drills
Drill 1 -- Forgetting to redirect transitions after clone
cpp
// Inside sa_extend, after creating clone:
sa[clone] = sa[q];
sa[clone].len = sa[p].len + 1;
sa[q].link = sa[cur].link = clone;
// (missing: redirect transitions)What's wrong?
After cloning, you must walk up from p and redirect all transitions on character c that point to q to point to clone:
cpp
while (p != -1 && sa[p].next[c] == q) {
sa[p].next[c] = clone;
p = sa[p].link;
}Without this, the automaton has incorrect paths and substring counts are wrong.
Drill 2 -- Wrong topological order for counting
cpp
// Counting occurrences: propagate up suffix links
for (int v = last_created; v >= 0; v--)
cnt[sa[v].link] += cnt[v];What's wrong?
Iterating by state index is not reverse topological order of suffix links. States must be sorted by decreasing len value. Use a counting sort on len:
cpp
// Build order[] sorted by decreasing len
vector<int> order(sz);
// ... counting sort by len ...
for (int v : order)
if (sa[v].link != -1)
cnt[sa[v].link] += cnt[v];Drill 3 -- Not resetting last for multi-string SAM
cpp
for (auto& s : strings) {
for (char c : s)
sa_extend(c);
// forgot to reset last = 0
}What's wrong?
When building a SAM for multiple strings, you must reset last = 0 (initial state) between strings. Otherwise the SAM connects the end of one string to the beginning of the next, creating spurious cross-string substrings. For generalized SAM, also mark terminal states per string.
Self-Test
Before moving to problems, verify you can do these without looking at the notes:
- [ ] Explain what a single SAM state represents (an equivalence class of substrings with the same
endposset) and whatlen[v]andlen[link[v]]mean for that class - [ ] State the maximum number of states (
) and transitions ( ), and explain why substrings compress to states - [ ] Explain the difference between a "last" state (gets
cnt = 1) and a "clone" state (getscnt = 0) during construction, and why this matters for occurrence counting - [ ] Check if pattern
is a substring of using the SAM in : start at state 0, follow transitions for each character - [ ] Compute distinct substrings as
and verify on "aba": states have lens 1, 2, 3 with links to lens 0, 0, 1 ->
Practice Problems
| CF Rating | What You Should Be Able to Do |
|---|---|
| 1200 | Understand what a suffix automaton represents (no implementation expected yet). |
| 1500 | Build a SAM and count distinct substrings using the formula |
| 1800 | Use the SAM for longest common substring of two strings (feed second string into SAM of first). |
| 2100 | Combine SAM with DP for |
| # | Problem | Source | Difficulty | Key Idea |
|---|---|---|---|---|
| 1 | Distinct Substrings | SPOJ | 1400 | SAM, count distinct substrings |
| 2 | Longest Common Substring | SPOJ | 1500 | Build SAM on one string, walk the other |
| 3 | Longest Common Substring II | SPOJ | 1900 | SAM + walk multiple strings, per-state min-max |
| 4 | String | CF 123D | 1900 | SAM, occurrence counting via suffix links |
| 5 | Cyclical Quest | CF 235C | 2000 | SAM on text, rotate query via suffix link walk |
| 6 | Match & Catch | CF 427D | 2000 | SAM on concatenated strings, unique occurrence |
| 7 | String | CF 128B | 2100 | SAM, |
| 8 | Fake News (hard) | CF 802I | 2300 | SAM, sum of squares of occurrence counts |
| 9 | Forbidden Indices | CF 873F | 2100 | SAM / SA, endpos-set tracking with forbidden positions |
| 10 | Anthem of Berland | CF 808G | 2200 | SAM + DP, substring matching with wildcards |
| 11 | Distinct Substrings | CSES | 1800 | Count distinct substrings -- classic SAM application |
| 12 | Substring Order I | CSES | 2000 | Find k-th lexicographically smallest distinct substring |
| 13 | Substring Order II | CSES | 2100 | k-th substring counting multiplicities |
Further Reading
- cp-algorithms: Suffix Automaton -- full theory, proofs of
state and transition bounds, implementation. - CF Blog: A short guide to suffix automata -- concise tutorial with diagrams.
- CF Blog: Suffix Structures -- comparison of suffix array, suffix tree, and suffix automaton.
- CF Blog: SAM applications -- advanced applications: LCS of multiple strings, building suffix tree from SAM.
- See:
04-suffix-array.mdfor an alternative offline structure with comparable power. - See:
06-aho-corasick.mdfor multi-pattern matching (different use case but related automaton construction). - See:
01-string-hashing.mdfor a simpler probabilistic approach to substring problems.