Skip to content

Suffix Automaton

Quick Revisit

  • USE WHEN: Counting distinct substrings, longest common substring, or any query over all substrings of a string
  • INVARIANT: Each state = equivalence class of substrings sharing the same endpos set; suffix link points to the longest proper suffix in a different class
  • TIME: O(n) online construction | SPACE: O(n × σ)
  • CLASSIC BUG: Forgetting to clone the state when the existing transition target has len ≠ cur.len + 1 (leads to wrong endpos sets)
  • PRACTICE: 07-ladder-strings

The smallest deterministic finite automaton (DFA) that accepts exactly all suffixes of a string. Its states correspond to equivalence classes of substrings sharing the same set of ending positions, giving a structure with at most 2n1 states and 3n4 transitions -- enough to answer "count distinct substrings," "longest common substring," and many other queries in O(n) time with online construction.

Difficulty: [Advanced]Prerequisites: String Hashing, Suffix Array, TrieCF Rating Relevance: 1900-2100+


Contents

Intuition

"Count the number of distinct substrings of "abab"."

The brute-force plan: generate every substring, dump them into a set, return the size.

a, b, a, b, ab, ba, ab, aba, bab, abab, ...
Set: {a, b, ab, ba, aba, bab, abab}  -->  7

There are n(n+1)2=10 substrings total. After deduplication we get 7 distinct ones. But building the set costs O(n2) substrings, each up to O(n) characters to hash or compare -- O(n3) in the worst case, or O(n2) with hashing. For n=105 that is 1010 operations: far too slow.

We need a structure that captures all substrings without enumerating them one by one.

A suffix automaton is the smallest DFA that accepts exactly all suffixes of a string -- and its states correspond to equivalence classes of substrings, giving O(n) states and transitions.

Every substring of s is a prefix of some suffix. So a DFA that accepts all suffixes implicitly encodes every substring as a path from the initial state. The magic: substrings that always occur at the same set of ending positions are equivalent -- they share a state. This compression reduces potentially O(n2) substrings to at most 2n1 equivalence classes.

For counting distinct substrings, each state v contributes len(v)len(link(v)) new substrings (the lengths it "owns" beyond its suffix-link parent). Sum over all states -- done in O(n).

Visual Walkthrough

Build SAM for "abab" incrementally. Each step adds one character to the end and extends the automaton.

Notation: Each state shows its id, len (longest substring in the class), and link (suffix link). Transitions are labeled with characters.

Step 0 -- Initial state (empty string):

  State 0: len=0, link=-1  (initial)
  last = 0

Step 1 -- Append 'a':

  Create state 1: len=1, link=0
  Add transition 0 --a--> 1
  last = 1

  Automaton:
    (0) --a--> (1)

  Substrings: {a}   count = 1

Step 2 -- Append 'b':

  Create state 2: len=2, link=0
  Add transition 1 --b--> 2
  Walk suffix links from old last=1:
    state 1 has no 'b' -> add 0 --b--> 2
    state 0 has no 'b' -> (already added above, stop at -1)
  last = 2

  Automaton:
    (0) --a--> (1) --b--> (2)
     |                       ^
     +----------b-----------+

  Substrings: {a, b, ab}   count = 3

Step 3 -- Append 'a':

  Create state 3: len=3, link=?
  Add transition 2 --a--> 3
  Walk suffix links from old last=2:
    state 0 has 'a' -> goes to state 1
    Check: len(1) == len(0) + 1?  1 == 0+1? YES
    So link(3) = 1
  last = 3

  Automaton:
    (0) --a--> (1) --b--> (2) --a--> (3)
     |                       ^
     +----------b-----------+

  Suffix links: 3->1, 2->0, 1->0

  Substrings: {a, b, ab, ba, aba}   count = 5

Step 4 -- Append 'b':

  Create state 4: len=4, link=?
  Add transition 3 --b--> 4
  Walk suffix links from old last=3:
    state 1 has 'b' -> goes to state 2
    Check: len(2) == len(1) + 1?  2 == 1+1? YES
    So link(4) = 2
  last = 4

  Automaton:
    (0) --a--> (1) --b--> (2) --a--> (3) --b--> (4)
     |                       ^
     +----------b-----------+

  Suffix links: 4->2, 3->1, 2->0, 1->0

  Substrings: {a, b, ab, ba, aba, bab, abab}   count = 7

Final state diagram:

         a         b         a         b
  (0) -------> (1) -----> (2) -----> (3) -----> (4)
   |                        ^
   +----------b-------------+

  Suffix links (dashed):
  (4) -.-> (2) -.-> (0)
  (3) -.-> (1) -.-> (0)

  State | len | link | Represents (endpos class)
  ------+-----+------+----------------------------
    0   |  0  |  -1  | {empty string}
    1   |  1  |   0  | "a"         endpos={1,3}
    2   |  2  |   0  | "b","ab"    endpos={2,4}
    3   |  3  |   1  | "ba","aba"  endpos={3}
    4   |  4  |   2  | "bab","abab" endpos={4}

  Distinct substrings = sum of (len[v] - len[link[v]]):
    (1-0) + (2-0) + (3-1) + (4-2) = 1 + 2 + 2 + 2 = 7  OK

A second example: "abb" forces a clone

The walkthrough above is clean, but it never triggered cloning -- and cloning is the case that actually decides whether your extend() is correct. Here is the smallest canonical example that does.

  Steps 1-2: append 'a', then 'b'. Identical to the first walkthrough.
    States 0, 1 ("a"), 2 ("b","ab").  last = 2.

  Step 3: append the SECOND 'b'.
    Create state 3 with len = 3.
    Walk suffix links from p = last = 2:
      state 2 has no 'b' transition  -> add 2 --b--> 3.  p = link(2) = 0.
      state 0 already has 'b' -> goes to state 2.  Stop.
    Now we examine q = 2:
      len(q) = 2, len(p) + 1 = 0 + 1 = 1.  THEY DIFFER.
      The transition 0 --b--> 2 was created when "ab" first appeared, with len=2.
      But for the new context (entering on 'b' alone, not via 'a'), the
      transition target should have len = 1 -- not len = 2.
      The state 2 is overloaded: it currently mixes two endpos classes
      that the new character has separated.  So we clone.

      Create state 4 = copy of state 2 (same transitions, same suffix link).
      Set len(4) = 1   (the correct length for the "just b" context).
      Redirect every ancestor's 'b' transition that pointed to 2 to point to 4
      instead -- but only while the transition still goes to 2.  Walking up
      from p = 0:  0 --b--> 2, redirect to 0 --b--> 4.  link(0) = -1, stop.
      Set link(2) = 4 and link(3) = 4.
    last = 3.

  Final automaton ("abb"):
    State | len | link | Represents
    ------+-----+------+------------
      0   |  0  |  -1  | empty
      1   |  1  |   0  | "a"        endpos = {0}
      2   |  2  |   4  | "ab"       endpos = {1}     <- only "ab", not "b"
      3   |  3  |   4  | "abb","bb" endpos = {2}
      4   |  1  |   0  | "b"        endpos = {1, 2}  <- the clone

  Without cloning, state 2 would have claimed endpos = {1, 2}, but
  that would say "ab" ends at position 2 -- false.  The clone splits
  the overloaded state into "b" alone (positions 1, 2) and "ab"
  (position 1 only).

The pattern to remember:

A clone is needed exactly when the existing transition target q has len(q)>len(p)+1. That inequality means q used to carry a longer string in its endpos class -- and the new character has just exposed that the shorter version of q's class has a strictly larger endpos set than the longer version. Splitting them is what cloning does.

Suffixes vs. substrings: two views of the same automaton

The SAM is sometimes described as "the smallest DFA accepting all suffixes of s" and sometimes as "a structure encoding all substrings of s." Both are true, but they are answering different questions:

  • Accepting states represent suffixes. A state is terminal (accepting) if it lies on the suffix-link chain from last after the final character is added. The set of strings spelled along paths from state 0 to a terminal state equals the set of suffixes of s. This is the DFA-theoretic identity.
  • Paths from the start spell substrings. Every distinct substring of s corresponds to exactly one path from state 0, regardless of whether the endpoint is terminal. So "is p a substring of s?" reduces to "does the path labelled p from state 0 exist?" -- you do not need terminal information.

Counting distinct substrings uses the path view (what (len(v)len(link(v))) measures). Walking through the SAM to test substring membership uses the path view too. Terminal states only come up when you specifically need suffixes -- e.g., listing them or checking whether a queried string is a suffix.

  Two views of the same SAM (string "abab"):

  TRANSITION DAG (for matching):     SUFFIX-LINK TREE (for counting):
  Follow edges to spell substrings.  Propagate counts toward root.

       a       b       a       b            State 0 (len=0)
  (0)───>(1)───>(2)───>(3)───>(4)          ╱           ╲
   │              ▲                   State 1(len=1)  State 2(len=2)
   └──────b───────┘                       │               │
                                    State 3(len=3)  State 4(len=4)
  Use DAG for: "is P a substring?"
  Use tree for: "how many times      endpos(4) ⊂ endpos(2) ⊂ endpos(0)
    does P appear?"                   endpos(3) ⊂ endpos(1) ⊂ endpos(0)

The Invariant

Each state represents an equivalence class of substrings that appear at the same set of ending positions (the endpos set). The suffix link of state v points to the longest proper suffix of v's longest string that belongs to a different equivalence class. For every state v: len(link(v))<len(v), and the endpos set of v is a strict subset of the endpos set of link(v).

This nesting of endpos sets along suffix links forms a tree (the "parent tree" or "suffix link tree"). The tree has at most 2n1 nodes -- which is why the automaton has O(n) states.

  Endpos sets nest along suffix links (string "abab"):

  State 4: "abab","bab"  endpos = {3}
       │ suffix link

  State 2: "ab","b"      endpos = {1, 3}
       │ suffix link

  State 0: ""             endpos = {0, 1, 2, 3}

  State 3: "aba","ba"    endpos = {2}
       │ suffix link

  State 1: "a"           endpos = {0, 2}
       │ suffix link

  State 0: ""             endpos = {0, 1, 2, 3}

  Key property: endpos(child) ⊂ endpos(parent)
  This nesting forms a TREE -- the suffix-link tree.
  At most 2n - 1 nodes (states).

What the Code Won't Teach You

States are equivalence classes, not individual substrings. Each state represents all substrings sharing the same endpos set (the positions where they end in the original string). One state can represent many substrings -- specifically all those with lengths in [len(link(v))+1,len(v)]. The compression from O(n2) substrings to O(n) states is the whole point, but the code just manipulates state IDs without revealing what they mean.

Clone states vs. real states matter for counting. During extend(), when a state is cloned, the clone inherits transitions and the suffix link but must NOT inherit the occurrence count. Clones are structural -- they split an equivalence class but don't represent a new suffix endpoint. If you copy cnt into clones, occurrence counts are silently inflated. The code does this correctly, but doesn't explain why.

The suffix-link tree is where the real power lives. Most SAM queries (occurrence counting, endpos set computation, DP over substrings) operate on the suffix-link tree, not on the transition DAG. The transition DAG is for matching (walking a pattern through the automaton). The suffix-link tree is for counting (propagating information from children to parents). Confusing the two leads to wrong DP formulations.

The "aha moment": endpos sets are a partition of reality. The key insight that makes the SAM click is understanding why there are only O(n) states for O(n2) substrings. If two substrings always appear at exactly the same positions in s, they're indistinguishable from the automaton's perspective -- so they collapse into one state. The moment you truly internalize this, the extend() operation stops being arbitrary case analysis and becomes "we're maintaining the endpos partition as we add a character." Every case in extend corresponds to a specific way the partition changes.

Topological order is not optional for DP. A surprisingly common bug: running a DP on the SAM states without processing them in the right order. For propagating counts up the suffix-link tree, you need reverse topological order (children before parents). Sorting states by len in decreasing order gives this. For counting paths through the transition DAG (e.g., k-th substring), you need forward topological order. The SAM has at most 2n states, so sorting by len is O(n) with counting sort -- but forgetting to sort at all gives silently wrong answers that pass small tests.

The SAM subsumes many simpler string structures. The suffix-link tree of the SAM is isomorphic to the suffix tree of the reversed string. The number of distinct substrings, the number of occurrences of any pattern, the longest repeated substring -- all are direct queries on the SAM. Once you're fluent with the SAM, you'll find yourself reaching for it even on problems that "should" use simpler tools, because the unified framework means less context-switching. The danger is overengineering: if KMP or hashing solves the problem in 15 lines, don't build a 60-line SAM out of habit.

When to Reach for This

Trigger phrases in problem statements:

  • "count distinct substrings" SAM, sum len(v)len(link(v)) over all states.
  • "longest common substring of two strings" build SAM on one, walk the other through it.
  • "online string processing" (characters appended one at a time) SAM's incremental construction.
  • "number of occurrences of each substring" SAM + counting terminal paths in the suffix link tree.
  • "smallest cyclic shift" build SAM on s+s, find the lexicographically smallest path of length n.
  • "k-th lexicographically smallest substring" SAM with precomputed path counts.

Counter-examples -- do NOT use SAM when:

  • Single pattern search KMP or Z-function is simpler.
  • Multiple patterns in a dictionary Aho-Corasick is better suited.
  • Only need suffix array + LCP and problem is offline Suffix Array may be simpler.
  • String is short (n5000) brute force or DP may suffice.

C++ STL Reference

Function / TypeHeaderWhat it doesTime Complexity
map<char, int><map>Sparse transition table per state (large alphabet).O(logσ) per access
array<int, 26> or int next[26]<array>Dense transition table (lowercase-only alphabet).O(1) per access
vector<int><vector>Store suffix link tree adjacency, topological order.O(1) amortized push
fill(first, last, val)<algorithm>Initialize transition arrays to 1.O(σ)
reverse(first, last)<algorithm>Reverse topological order for DP on suffix link tree.O(n)
string::push_back(c)<string>Append character during online construction.O(1) amortized
memset(ptr, val, size)<cstring>Fast initialization of transition arrays.O(σ)

Implementation (Contest-Ready)

Version 1: Minimal SAM -- Struct-Based with Dense Transitions

cpp
#include <bits/stdc++.h>
using namespace std;

struct SuffixAutomaton {
    struct State {
        int len, link;
        int next[26];
        State() : len(0), link(-1) { memset(next, -1, sizeof(next)); }
    };

    vector<State> st;
    int last;

    void init() {
        st.clear();
        st.emplace_back(); // state 0: initial
        last = 0;
    }

    void extend(int c) {
        // If transition already exists from last on c (clone case shortcut)
        if (st[last].next[c] != -1) {
            int q = st[last].next[c];
            if (st[q].len == st[last].len + 1) {
                last = q;
                return;
            }
            int clone = st.size();
            st.push_back(st[q]);
            st[clone].len = st[last].len + 1;
            while (last != -1 && st[last].next[c] == q) {
                st[last].next[c] = clone;
                last = st[last].link;
            }
            st[q].link = clone;
            last = clone;
            return;
        }
        int cur = st.size();
        st.emplace_back();
        st[cur].len = st[last].len + 1;
        int p = last;
        while (p != -1 && st[p].next[c] == -1) {
            st[p].next[c] = cur;
            p = st[p].link;
        }
        if (p == -1) {
            st[cur].link = 0;
        } else {
            int q = st[p].next[c];
            if (st[q].len == st[p].len + 1) {
                st[cur].link = q;
            } else {
                int clone = st.size();
                st.push_back(st[q]);       // copy transitions + link
                st[clone].len = st[p].len + 1;
                while (p != -1 && st[p].next[c] == q) {
                    st[p].next[c] = clone;
                    p = st[p].link;
                }
                st[q].link = clone;
                st[cur].link = clone;
            }
        }
        last = cur;
    }

    void build(const string& s) {
        init();
        st.reserve(2 * s.size());
        for (char c : s) extend(c - 'a');
    }

    // Count distinct substrings
    long long count_distinct() {
        long long ans = 0;
        for (int i = 1; i < (int)st.size(); i++)
            ans += st[i].len - st[st[i].link].len;
        return ans;
    }
};

int main() {
    ios::sync_with_stdio(false);
    cin.tie(nullptr);

    string s;
    cin >> s;

    SuffixAutomaton sam;
    sam.build(s);
    cout << sam.count_distinct() << "\n";
}

Version 2: SAM with Occurrence Counting and Size-Tracking

cpp
#include <bits/stdc++.h>
using namespace std;

struct SAM {
    struct State {
        int len, link;
        long long cnt;         // number of times this class occurs
        int next[26];
        State() : len(0), link(-1), cnt(0) {
            memset(next, -1, sizeof(next));
        }
    };

    vector<State> st;
    int last;

    void init() {
        st.clear();
        st.emplace_back();
        last = 0;
    }

    void extend(int c) {
        if (st[last].next[c] != -1) {
            int q = st[last].next[c];
            if (st[q].len == st[last].len + 1) {
                last = q;
                st[last].cnt++;
                return;
            }
            int clone = st.size();
            st.push_back(st[q]);
            st[clone].len = st[last].len + 1;
            st[clone].cnt = 0;
            while (last != -1 && st[last].next[c] == q) {
                st[last].next[c] = clone;
                last = st[last].link;
            }
            st[q].link = clone;
            last = clone;
            st[last].cnt++;
            return;
        }
        int cur = st.size();
        st.emplace_back();
        st[cur].len = st[last].len + 1;
        st[cur].cnt = 1;
        int p = last;
        while (p != -1 && st[p].next[c] == -1) {
            st[p].next[c] = cur;
            p = st[p].link;
        }
        if (p == -1) {
            st[cur].link = 0;
        } else {
            int q = st[p].next[c];
            if (st[q].len == st[p].len + 1) {
                st[cur].link = q;
            } else {
                int clone = st.size();
                st.push_back(st[q]);
                st[clone].len = st[p].len + 1;
                st[clone].cnt = 0;
                while (p != -1 && st[p].next[c] == q) {
                    st[p].next[c] = clone;
                    p = st[p].link;
                }
                st[q].link = clone;
                st[cur].link = clone;
            }
        }
        last = cur;
    }

    // Propagate counts up the suffix link tree
    // (topological sort by len, process in reverse order)
    void calc_cnt() {
        int n = st.size();
        vector<int> order(n);
        iota(order.begin(), order.end(), 0);
        sort(order.begin(), order.end(), [&](int a, int b) {
            return st[a].len > st[b].len;
        });
        for (int v : order) {
            if (st[v].link >= 0)
                st[st[v].link].cnt += st[v].cnt;
        }
    }

    void build(const string& s) {
        init();
        st.reserve(2 * s.size());
        for (char c : s) extend(c - 'a');
        calc_cnt();
    }
};

int main() {
    ios::sync_with_stdio(false);
    cin.tie(nullptr);

    string s;
    cin >> s;

    SAM sam;
    sam.build(s);

    // Find the longest substring occurring at least k times
    int k;
    cin >> k;
    int ans = 0;
    for (int i = 1; i < (int)sam.st.size(); i++)
        if (sam.st[i].cnt >= k)
            ans = max(ans, sam.st[i].len);
    cout << ans << "\n";
}

Version 3: SAM with Map Transitions (Large / Arbitrary Alphabet)

cpp
#include <bits/stdc++.h>
using namespace std;

struct SAM {
    struct State {
        int len, link;
        map<int, int> next;
        State() : len(0), link(-1) {}
    };

    vector<State> st;
    int last;

    void init() {
        st.clear();
        st.emplace_back();
        last = 0;
    }

    void extend(int c) {
        if (st[last].next.count(c)) {
            int q = st[last].next[c];
            if (st[q].len == st[last].len + 1) {
                last = q;
                return;
            }
            int clone = st.size();
            st.push_back(st[q]);
            st[clone].len = st[last].len + 1;
            while (last != -1 && st[last].next.count(c) && st[last].next[c] == q) {
                st[last].next[c] = clone;
                last = st[last].link;
            }
            st[q].link = clone;
            last = clone;
            return;
        }
        int cur = st.size();
        st.emplace_back();
        st[cur].len = st[last].len + 1;
        int p = last;
        while (p != -1 && !st[p].next.count(c)) {
            st[p].next[c] = cur;
            p = st[p].link;
        }
        if (p == -1) {
            st[cur].link = 0;
        } else {
            int q = st[p].next[c];
            if (st[q].len == st[p].len + 1) {
                st[cur].link = q;
            } else {
                int clone = st.size();
                st.push_back(st[q]);
                st[clone].len = st[p].len + 1;
                while (p != -1 && st[p].next.count(c) && st[p].next[c] == q) {
                    st[p].next[c] = clone;
                    p = st[p].link;
                }
                st[q].link = clone;
                st[cur].link = clone;
            }
        }
        last = cur;
    }
};

Operations Reference

The table below summarizes every common SAM query and its cost -- use it to quickly verify that your approach fits within time and memory limits.

OperationTimeSpace
Build SAM (incremental, alphabet σ)O(n) amortized (dense) / O(nlogσ) (map)O(nσ) (dense) / O(n) (map)
Count distinct substringsO(n) after buildO(1) extra
Count occurrences of each endpos classO(nlogn) topological sort + propagationO(n) extra
Longest common substring (two strings)O(n+m)O(nσ) for first string's SAM
Check if pattern p is a substring$O(p
k-th smallest distinct substringO(n) preprocessing + O(n) per queryO(n) extra
Smallest cyclic shiftO(n)O(nσ) for SAM of s+s
Total occurrences of a pattern$O(p

Problem Patterns & Tricks

Pattern 1: Count Distinct Substrings

Each state v (except the initial state 0) represents an equivalence class whose longest string has length len(v) and whose shortest has length len(link(v))+1. The number of new substrings contributed by v is len(v)len(link(v)).

distinct=v0(len(v)len(link(v)))
cpp
long long count_distinct(const SAM& sam) {
    long long ans = 0;
    for (int i = 1; i < (int)sam.st.size(); i++)
        ans += sam.st[i].len - sam.st[sam.st[i].link].len;
    return ans;
}

CF examples: CF 123D -- String, SPOJ DISUBSTR.


Pattern 2: Longest Common Substring of Two Strings

Build SAM on string s. Walk string t through the automaton, maintaining the current state and the current match length. If a transition exists, increment the length. Otherwise, follow suffix links (shrinking the match) until a transition exists or you reach the initial state.

cpp
int lcs(const SAM& sam, const string& t) {
    int v = 0, cur_len = 0, best = 0;
    for (char ch : t) {
        int c = ch - 'a';
        while (v != 0 && sam.st[v].next[c] == -1) {
            v = sam.st[v].link;
            cur_len = sam.st[v].len;
        }
        if (sam.st[v].next[c] != -1) {
            v = sam.st[v].next[c];
            cur_len++;
        }
        best = max(best, cur_len);
    }
    return best;
}

For k>2 strings: build SAM on the first string, then for each additional string walk it through and record the maximum match length per state. After all strings, take the minimum across strings per state, then the maximum across states. Alternatively, concatenate all strings with distinct separators and build one SAM.

CF examples: SPOJ LCS2 -- Longest Common Substring II, CF 427D -- Match & Catch.


Pattern 3: Occurrence Counting

After building the SAM, propagate counts up the suffix link tree. A state that was "last" after some character was appended gets count 1. Cloned states start with count 0. Propagating in reverse topological order (longest len first) sums descendant counts into ancestor counts.

Then st[v].cnt is the number of times the endpos class of v appears -- equivalently, the number of occurrences of any substring in that class.

cpp
// After build + calc_cnt:
// st[v].cnt = number of occurrences of the longest string in class v
//           = |endpos(v)|

To find the number of occurrences of a specific pattern, walk the pattern through the SAM to find its state, then read st[v].cnt.

CF examples: CF 123D -- String, CF 802I -- Fake News (hard).

  Occurrence count propagation on suffix-link tree:

  String "abab" -- propagate in reverse topological order
  (longest len first -> shortest len last)

  Step 1: Process state 4 (len=4, cnt=1)
           Add cnt to parent state 2:  cnt[2] += 1

  Step 2: Process state 3 (len=3, cnt=1)
           Add cnt to parent state 1:  cnt[1] += 1

  Step 3: Process state 2 (len=2, cnt=1+1=2)
           Add cnt to parent state 0:  cnt[0] += 2

  Step 4: Process state 1 (len=1, cnt=1+1=2)
           Add cnt to parent state 0:  cnt[0] += 2

  Final counts:
  State 0 (""):            cnt = 4  (endpos size = 4 OK)
  State 1 ("a"):           cnt = 2  (appears at pos 0,2 OK)
  State 2 ("b","ab"):      cnt = 2  (appear at pos 1,3 OK)
  State 3 ("ba","aba"):    cnt = 1  (appears at pos 2 OK)
  State 4 ("bab","abab"):  cnt = 1  (appears at pos 3 OK)

Pattern 4: Smallest Cyclic Shift

Concatenate s+s (length 2n), build its SAM. Starting from state 0, greedily follow the lexicographically smallest transition n times. The path spells the smallest cyclic rotation.

cpp
string smallest_cyclic(const string& s) {
    int n = s.size();
    string t = s + s;
    SAM sam;
    sam.init();
    sam.st.reserve(2 * t.size());
    for (char c : t) sam.extend(c - 'a');

    string res;
    int v = 0;
    for (int i = 0; i < n; i++) {
        for (int c = 0; c < 26; c++) {
            if (sam.st[v].next[c] != -1) {
                res += (char)('a' + c);
                v = sam.st[v].next[c];
                break;
            }
        }
    }
    return res;
}

CF examples: CF 235C -- Cyclical Quest (related).


Pattern 5: k-th Smallest Distinct Substring

Precompute for each state the number of distinct substrings reachable via transitions (DFS/DP on the DAG of transitions). Then walk from state 0, at each step choosing the transition whose subtree count brackets k.

cpp
// dp[v] = number of distinct substrings reachable from state v (including v itself)
void calc_dp(const SAM& sam, vector<long long>& dp) {
    int n = sam.st.size();
    dp.assign(n, 0);
    // topological order by transitions (BFS or DFS)
    vector<int> order;
    vector<int> indeg(n, 0);
    for (int v = 0; v < n; v++)
        for (int c = 0; c < 26; c++)
            if (sam.st[v].next[c] != -1)
                indeg[sam.st[v].next[c]]++;
    queue<int> q;
    for (int v = 0; v < n; v++)
        if (indeg[v] == 0) q.push(v);
    while (!q.empty()) {
        int v = q.front(); q.pop();
        order.push_back(v);
        for (int c = 0; c < 26; c++)
            if (sam.st[v].next[c] != -1)
                if (--indeg[sam.st[v].next[c]] == 0)
                    q.push(sam.st[v].next[c]);
    }
    reverse(order.begin(), order.end());
    for (int v : order) {
        dp[v] = 1; // count state v itself (empty string for root, else a substring)
        for (int c = 0; c < 26; c++)
            if (sam.st[v].next[c] != -1)
                dp[v] += dp[sam.st[v].next[c]];
    }
    dp[0]--; // don't count the empty string
}

string kth_substring(const SAM& sam, const vector<long long>& dp, long long k) {
    string res;
    int v = 0;
    while (k > 0) {
        for (int c = 0; c < 26; c++) {
            int u = sam.st[v].next[c];
            if (u == -1) continue;
            if (dp[u] < k) {
                k -= dp[u];
            } else {
                res += (char)('a' + c);
                v = u;
                k--;  // count this node
                break;
            }
        }
    }
    return res;
}

CF examples: CF 128B -- String.


Pattern 6: Substring Matching and Testing

To check whether pattern p is a substring of s, walk p through the SAM. If every character finds a valid transition, p is a substring.

cpp
bool is_substring(const SAM& sam, const string& p) {
    int v = 0;
    for (char ch : p) {
        int c = ch - 'a';
        if (sam.st[v].next[c] == -1) return false;
        v = sam.st[v].next[c];
    }
    return true;
}

This is O(|p|) per query after O(n) construction -- useful when you have many patterns to test against the same text.


Contest Cheat Sheet

+--------------------------------------------------------------+
|               SUFFIX AUTOMATON CHEAT SHEET                    |
+--------------------------------------------------------------+
| WHEN TO USE:                                                  |
|   - Count distinct substrings  --> sum(len[v]-len[link[v]])   |
|   - Longest common substring   --> walk 2nd string on SAM     |
|   - Occurrence count per substr--> propagate cnt up link tree |
|   - Smallest cyclic shift      --> SAM on s+s, greedy walk    |
|   - k-th smallest substring    --> DP on transition DAG       |
|   - Online (append chars)      --> extend() is O(1) amort.   |
+--------------------------------------------------------------+
| CONSTRUCTION:                                                 |
|   SAM sam;                                                    |
|   sam.init();                                                 |
|   for (char c : s) sam.extend(c - 'a');                       |
+--------------------------------------------------------------+
| KEY FORMULAS:                                                 |
|   States:       <= 2n - 1                                     |
|   Transitions:  <= 3n - 4                                     |
|   Distinct subs = sum over v!=0 of (len[v] - len[link[v]])   |
|   Occurrences   = |endpos(v)| = cnt[v] after propagation     |
+--------------------------------------------------------------+
| SUFFIX LINK TREE:                                             |
|   endpos(v) c endpos(link(v))    (strict subset)             |
|   len(link(v)) < len(v)                                       |
|   Children partition parent's endpos set                      |
+--------------------------------------------------------------+
| COMPLEXITY:                                                   |
|   Build:    O(n) amortized (dense alpha)                      |
|   Memory:   O(n * sigma) with arrays, O(n) with map          |
|   Query:    O(|pattern|) for membership                       |
+--------------------------------------------------------------+
| WATCH OUT:                                                    |
|   - Clone states get cnt=0 (they are not "real" endpoints)    |
|   - Propagate cnt in REVERSE topo order (longest len first)   |
|   - SAM has NO sentinel -- unlike suffix array                 |
|   - st[0] is the initial state (empty string), skip in sums  |
+--------------------------------------------------------------+

Gotchas & Debugging

Clone States and Counting

When a state is cloned during extend(), the clone inherits transitions and the suffix link but must not inherit the occurrence count. Clones represent the same equivalence class being split -- only the original "real" extension gets count 1. If you copy cnt into clones, every occurrence count will be inflated.

Forgetting to Propagate Counts

Raw counts after construction only mark states that were last at some step. To get the total number of occurrences of a substring class, you must propagate counts up the suffix link tree in reverse topological order. Without this step, cnt values are 0 for most states.

Transition Array vs Map

Using int next[26] is fast (O(1) per lookup) but burns O(26×2n) memory. For n=2×105 that is about 40 MB -- usually fine but tight. If MLE is a concern or the alphabet is large (digits + letters, or arbitrary integers), switch to map<int,int> at the cost of an O(logσ) factor.

Off-by-One: State 0 in Sums

State 0 is the initial state representing the empty string. When computing distinct substrings, skip state 0 -- it contributes len(0)len(link(0)) which is 0(1)=1 if you use 1 as the sentinel link length, which would incorrectly count the empty string.

Memory: Reserve Capacity

The SAM can have up to 2n1 states. Always st.reserve(2 * n) before building to avoid repeated reallocations that cause TLE and invalidated references.

Verifying Correctness

For small inputs, verify:

  1. State count 2n1.
  2. Transition count 3n4.
  3. For every state v0: len(link(v))<len(v).
  4. The suffix link tree is a valid rooted tree with root at state 0.
  5. Distinct substring count matches brute force.
cpp
// Quick sanity check
void verify(const SAM& sam, const string& s) {
    int n = s.size();
    assert((int)sam.st.size() <= 2 * n - 1 + 1); // +1 for state 0
    for (int v = 1; v < (int)sam.st.size(); v++) {
        assert(sam.st[v].link >= 0);
        assert(sam.st[sam.st[v].link].len < sam.st[v].len);
    }
    // Compare with brute-force distinct count
    set<string> brute;
    for (int i = 0; i < n; i++)
        for (int j = i + 1; j <= n; j++)
            brute.insert(s.substr(i, j - i));
    long long sam_cnt = 0;
    for (int i = 1; i < (int)sam.st.size(); i++)
        sam_cnt += sam.st[i].len - sam.st[sam.st[i].link].len;
    assert(sam_cnt == (long long)brute.size());
}

Debugging Tips

  • Print all states with len, link, and transitions for strings of length 8.
  • Test on "aaaa" (all same), "abcd" (all distinct), and "abcbc" (has clones).
  • If WA on occurrence counting, print the suffix link tree and verify endpos sets.

Mental Traps

Trap 1: "The suffix automaton is like a trie of suffixes."

A suffix trie has O(n2) nodes -- one for every prefix of every suffix. The SAM has at most 2n1 states because it merges substrings with identical endpos sets into a single state. Thinking "trie" gives you the wrong mental model for what states represent, why cloning exists, and how counting works.

  Suffix trie of "aba":           Suffix automaton of "aba":
  (every prefix of every suffix)  (endpos equivalence classes)

         root                         (0)──a──>(1)──b──>(2)──a──>(3)
        / | \                          │                  ▲
       a  b  a                         └────b─────────────┘
       |  |  
      ba  a                     States: 4 (vs. trie's ~9 nodes)
      |                         Represents 6 substrings in 4 classes:
     ba                           "a"->{0,2}, "b","ab"->{1}, 
                                  "ba","aba"->{2}, "a" (end)->...
  ~9 nodes for n=3              <= 2n-1 = 5 nodes for n=3

Trap 2: "I'll just paste the template and figure out queries later."

The construction is ~40 lines. But every query requires understanding what states represent. "Count distinct substrings" needs sum(len[v] - len[link[v]]). "Count occurrences" needs propagation on the suffix-link tree. "k-th substring" needs DP on the transition DAG. Without understanding states-as-equivalence-classes, you can't write any of these -- you can only copy them. And copied queries break when the problem has even a slight twist.


Common Mistakes

  1. Forgetting transition redirect after clone. "My SAM had the right number of states but gave wrong distinct-substring counts..." After cloning state q into clone, you must redirect all transitions of ancestors of p that pointed to q (for the current character) to point to clone instead. Without this redirect, some paths bypass the clone and double-count substrings. The SAM invariant: after cloning, every ancestor of p that had a transition to q on character c must now point to clone. Walk up the suffix-link chain from p and redirect until you hit a state that no longer points to q.

Debug Drills

Drill 1 -- Forgetting to redirect transitions after clone

cpp
// Inside sa_extend, after creating clone:
sa[clone] = sa[q];
sa[clone].len = sa[p].len + 1;
sa[q].link = sa[cur].link = clone;
// (missing: redirect transitions)
What's wrong?

After cloning, you must walk up from p and redirect all transitions on character c that point to q to point to clone:

cpp
while (p != -1 && sa[p].next[c] == q) {
    sa[p].next[c] = clone;
    p = sa[p].link;
}

Without this, the automaton has incorrect paths and substring counts are wrong.

Drill 2 -- Wrong topological order for counting

cpp
// Counting occurrences: propagate up suffix links
for (int v = last_created; v >= 0; v--)
    cnt[sa[v].link] += cnt[v];
What's wrong?

Iterating by state index is not reverse topological order of suffix links. States must be sorted by decreasing len value. Use a counting sort on len:

cpp
// Build order[] sorted by decreasing len
vector<int> order(sz);
// ... counting sort by len ...
for (int v : order)
    if (sa[v].link != -1)
        cnt[sa[v].link] += cnt[v];

Drill 3 -- Not resetting last for multi-string SAM

cpp
for (auto& s : strings) {
    for (char c : s)
        sa_extend(c);
    // forgot to reset last = 0
}
What's wrong?

When building a SAM for multiple strings, you must reset last = 0 (initial state) between strings. Otherwise the SAM connects the end of one string to the beginning of the next, creating spurious cross-string substrings. For generalized SAM, also mark terminal states per string.


Self-Test

Before moving to problems, verify you can do these without looking at the notes:

  • [ ] Explain what a single SAM state represents (an equivalence class of substrings with the same endpos set) and what len[v] and len[link[v]] mean for that class
  • [ ] State the maximum number of states (2n1) and transitions (3n4), and explain why O(n2) substrings compress to O(n) states
  • [ ] Explain the difference between a "last" state (gets cnt = 1) and a "clone" state (gets cnt = 0) during construction, and why this matters for occurrence counting
  • [ ] Check if pattern P is a substring of S using the SAM in O(|P|): start at state 0, follow transitions for each character
  • [ ] Compute distinct substrings as v0(len[v]len[link[v]]) and verify on "aba": states have lens 1, 2, 3 with links to lens 0, 0, 1 -> 1+2+2=5

Practice Problems

CF RatingWhat You Should Be Able to Do
1200Understand what a suffix automaton represents (no implementation expected yet).
1500Build a SAM and count distinct substrings using the formula (len[v]len[link[v]]).
1800Use the SAM for longest common substring of two strings (feed second string into SAM of first).
2100Combine SAM with DP for k-th substring, endpos-set propagation, or multi-string generalization.
#ProblemSourceDifficultyKey Idea
1Distinct SubstringsSPOJ1400SAM, count distinct substrings
2Longest Common SubstringSPOJ1500Build SAM on one string, walk the other
3Longest Common Substring IISPOJ1900SAM + walk multiple strings, per-state min-max
4StringCF 123D1900SAM, occurrence counting via suffix links
5Cyclical QuestCF 235C2000SAM on text, rotate query via suffix link walk
6Match & CatchCF 427D2000SAM on concatenated strings, unique occurrence
7StringCF 128B2100SAM, k-th lexicographic substring via DP
8Fake News (hard)CF 802I2300SAM, sum of squares of occurrence counts
9Forbidden IndicesCF 873F2100SAM / SA, endpos-set tracking with forbidden positions
10Anthem of BerlandCF 808G2200SAM + DP, substring matching with wildcards
11Distinct SubstringsCSES1800Count distinct substrings -- classic SAM application
12Substring Order ICSES2000Find k-th lexicographically smallest distinct substring
13Substring Order IICSES2100k-th substring counting multiplicities

Further Reading

Built for the climb from Codeforces 1100 to 2100.