Suffix Array

Quick Revisit
USE WHEN: Distinct substrings, longest repeated substring, pattern matching over all suffixes
INVARIANT: sa[] is a permutation of [0..n-1] with suffix sa[i] < suffix sa[i+1] lexicographically; LCP array stores shared prefix lengths of adjacent sorted suffixes
TIME: O(n log n) build, O(m log n) pattern search | SPACE: O(n)
CLASSIC BUG: Missing sentinel character ($ < all others) so the last suffix sorts correctly
PRACTICE: 07-ladder-strings

Sort every suffix of a string, keep only the starting indices. This $O (n \log n)$ structure answers substring queries -- distinct substrings, longest repeated substring, pattern matching -- that would otherwise require a suffix tree.

Difficulty: [Advanced]Prerequisites: KMP and Z-Function, Sorting and SearchingCF Rating Relevance: 1900-2200+

Contest Frequency: * Specialized -- appears in advanced string problems

Intuition
C++ STL Reference
Implementation (Contest-Ready)
Operations Reference
Problem Patterns & Tricks
Contest Cheat Sheet
Gotchas & Debugging
Self-Test
Practice Problems
Further Reading

Intuition

Take the string "banana". You want to find every occurrence of a pattern, or the longest repeated substring. The brute-force plan: enumerate all $n$ suffixes, sort them lexicographically, then scan for answers.

The problem? Comparing two suffixes costs $O (n)$ in the worst case, and sorting $n$ items with comparison sort is $O (n \log n)$ comparisons, so naive construction is $O (n^{2} \log n)$ . For $n = 10^{5}$ that is roughly $10^{11.7}$ character comparisons -- way too slow.

We need a way to sort suffixes where each comparison is $O (1)$ , not $O (n)$ .

Sort suffixes by doubling -- first by 1 character, then 2, then 4, ... -- using the previous round's ranks to compare in $O (1)$ , giving $O (n \log n)$ total.

Here is the idea built from scratch:

Round 0. Rank every suffix by its first character alone. This is just sorting single characters -- $O (n)$ via counting sort, or $O (n \log n)$ via comparison sort.
Round 1. Now sort by the first 2 characters. But we already know the rank of every suffix by its first character. So the first-2-character key for suffix $i$ is the pair $(rank [i], rank [i + 1])$ . Comparing two pairs of integers is $O (1)$ .
Round 2. Sort by the first 4 characters. The key for suffix $i$ is $(rank [i], rank [i + 2])$ -- the rank of the first-2 block and the rank of the second-2 block (which starts at $i + 2$ ). Again $O (1)$ per comparison.
General round $k$ . Sort by the first $2^{k}$ characters using the pair $(rank [i], rank [i + 2^{k - 1}])$ . Each round is one sort of $n$ items with $O (1)$ comparisons.
Termination. After $⌈ \log_{2} n ⌉$ rounds the prefix length exceeds $n$ , so we are sorting by the full suffix. Total: $O (\log n)$ rounds $\times$ $O (n \log n)$ sort $= O (n \log^{2} n)$ , or $O (n \log n)$ if each round uses radix sort on pairs.

Once the suffix array is built, the LCP array (longest common prefix between each pair of adjacent sorted suffixes) solves "longest repeated substring" in $O (n)$ via Kasai's algorithm, which exploits the identity $lcp [i] \geq lcp [i - 1] - 1$ when traversing in text order.

Visual Walkthrough

String: "banana$" ( $n = 7$ , $ is a sentinel smaller than all letters.)

All suffixes:

  i=0: banana$
  i=1: anana$
  i=2: nana$
  i=3: ana$
  i=4: na$
  i=5: a$
  i=6: $

Round k=0 -- rank by first 1 character:

  char:  b   a   n   a   n   a   $
  rank: [1,  0,  2,  0,  2,  0, -1]
        ($ mapped to -1, a=0, b=1, n=2)

Round k=1 -- rank by first 2 characters, key = (rank[i], rank[i+1]):

  i : pair            meaning
  0 : (1, 0)          "ba"
  1 : (0, 2)          "an"
  2 : (2, 0)          "na"
  3 : (0, 2)          "an"
  4 : (2, 0)          "na"
  5 : (0,-1)          "a$"
  6 : (-1,*)          "$"   (* = out-of-bounds -> -inf)

  Sort pairs -> re-rank:
  rank: [3,  2,  4,  2,  4,  1,  0]
         b   a   n   a   n   a   $

Suffixes 1 and 3 share rank 2 (both start with "an"), and suffixes 2 and 4 share rank 4 (both start with "na"). Ties remain -- we need more rounds.

Round k=2 -- rank by first 4 characters, key = (rank[i], rank[i+2]):

  i : pair            meaning
  0 : (3, 4)          "bana"
  1 : (2, 2)          "anan"
  2 : (4, 4)          "nana"
  3 : (2, 1)          "ana$"
  4 : (4, 0)          "na$"
  5 : (1, *)          "a$"
  6 : (0, *)          "$"

  Sort pairs in ascending order, then re-rank:
  rank: [4,  3,  6,  2,  5,  1,  0]
         b   a   n   a   n   a   $

Every pair above is distinct (the second component breaks the round-1 ties on "an" and "na"), so each suffix gets a unique rank. We can stop -- once all $n$ ranks are distinct, the sort is complete.

Final suffix array (read off sorted order):

  sa[0] = 6   $
  sa[1] = 5   a$
  sa[2] = 3   ana$
  sa[3] = 1   anana$
  sa[4] = 0   banana$
  sa[5] = 4   na$
  sa[6] = 2   nana$

That is the suffix array: a permutation of $[0, n)$ ordering suffixes lexicographically. By itself it already solves "kth smallest substring" and "binary-search a pattern in $O (m \log n)$ ." Most of the famous applications, though, need one more array.

Past the construction: what the LCP array buys you

After the suffix array is built, the LCP array records the longest common prefix between each pair of adjacent sorted suffixes. The reason that is enough -- the reason we never need LCPs of arbitrary pairs to solve "longest repeated substring" or "count distinct substrings" -- is a neat structural fact:

For any $i < j$ in sorted order, $lcp (sa [i], sa [j]) = min (lcp [i + 1], \dots, lcp [j])$ .

So adjacent LCPs subsume non-adjacent ones via range-minimum queries. That is why "distinct substrings" only needs $\sum lcp [i]$ and "longest repeated substring" only needs $max lcp [i]$ .

Kasai's LCP computation (traverse in text order, maintain height $h$ ):

  Sorted suffixes:             LCP with predecessor:
  sa[0]=6:  $                  ---
  sa[1]=5:  a$                 lcp[1] = 0
  sa[2]=3:  ana$               lcp[2] = 1   ("a")
  sa[3]=1:  anana$             lcp[3] = 3   ("ana")
  sa[4]=0:  banana$            lcp[4] = 0
  sa[5]=4:  na$                lcp[5] = 0
  sa[6]=2:  nana$              lcp[6] = 2   ("na")

  lcp[] = { -, 0, 1, 3, 0, 0, 2 }

  Longest repeated substring: max(lcp) = 3  -->  "ana"
  Distinct substrings = 7*6/2 - (0+1+3+0+0+2) = 21 - 6 = 15
  (n=7 includes sentinel; formula is n*(n-1)/2 to exclude '$' substrings)

Step-by-Step Trace: Kasai's LCP Algorithm for `"banana$"`

Kasai traverses in text order (i = 0, 1, ..., 6), not SA order. The variable h carries forward: it decreases by at most 1 between steps.

  SA:   [6, 5, 3, 1, 0, 4, 2]
  rank: [4, 3, 6, 2, 5, 1, 0]     (inverse of SA: rank[sa[j]] = j)

  | Step | i | suffix    | rank[i] | SA predecessor         | h start | Compare              | lcp | h end |
  |------|---|-----------|---------|------------------------|---------|----------------------|-----|-------|
  |  1   | 0 | banana$   |    4    | sa[3]=1 -> anana$       |    0    | 'b' != 'a' -> stop    |  0  |   0   |
  |  2   | 1 | anana$    |    3    | sa[2]=3 -> ana$         |    0    | a=a, n=n, a=a OK      |  3  |   3   |
  |      |   |           |         |                        |         | then 'n' != '$' -> stop|     |       |
  |  3   | 2 | nana$     |    6    | sa[5]=4 -> na$          |  3-1=2  | s[4]='n' != '$' ->stop |  2  |   2   |
  |  4   | 3 | ana$      |    2    | sa[1]=5 -> a$           |  2-1=1  | s[4]='n' != '$' ->stop |  1  |   1   |
  |  5   | 4 | na$       |    5    | sa[4]=0 -> banana$      |  1-1=0  | 'n' != 'b' -> stop    |  0  |   0   |
  |  6   | 5 | a$        |    1    | sa[0]=6 -> $            |    0    | 'a' != '$' -> stop    |  0  |   0   |
  |  7   | 6 | $         |    0    | (rank=0, no pred.)     |    0    | -- skip --             |  --  |   0   |

  Result: lcp[] = [ -, 0, 1, 3, 0, 0, 2 ]

  Key: h dropped 3->2->1->0 (one per step), never jumped down.
  Total extensions across all steps <= 2n, giving Kasai's O(n) bound.

The Invariant

After round $k$ , the rank array correctly orders all suffixes by their first $2^{k}$ characters. Each round reuses the previous ranks: the sort key for suffix $i$ is the pair $(rank [i], rank [i + 2^{k - 1}])$ , which encodes $2^{k}$ characters in $O (1)$ .

This is why doubling works: two known halves of length $2^{k - 1}$ are combined into a single key of length $2^{k}$ with no extra character comparisons. The number of distinct ranks can only increase or stay the same each round, and once all $n$ ranks are unique the suffix array is complete.

  Prefix-doubling: how each round doubles the comparison window

  Round 0 (k=1): rank by first 1 character
  ┌───┐
  │ b │ a  n  a  n  a  $
  └───┘
  key = rank[i]

  Round 1 (k=2): rank by first 2 characters
  ┌───┬───┐
  │ b │ a │ n  a  n  a  $
  └───┴───┘
  key = (rank[i], rank[i+1])

  Round 2 (k=4): rank by first 4 characters
  ┌───┬───┬───┬───┐
  │ b │ a │ n │ a │ n  a  $
  └───┴───┴───┴───┘
  key = (rank[i], rank[i+2])  <- reuses round 1 ranks!

  Round 3 (k=8 > n): all suffixes fully compared. DONE.

  Each round: one sort of n pairs -> O(n log n)
  Total rounds: O(log n)  ->  O(n log² n) overall

What the Code Won't Teach You

$O (1)$ LCP queries require a sparse table. The LCP between sa[i] and sa[j] is $min (lcp [i + 1], \dots, lcp [j])$ -- a range minimum query. Without a sparse table (or segment tree), each LCP query costs $O (n)$ , making problems that need many pairwise LCP comparisons infeasible. The code builds SA + LCP but not the RMQ structure -- you need to know to add it yourself.

Kasai's $h \geq h_{prev} - 1$ property is non-obvious. The key to Kasai's $O (n)$ LCP construction: if suffix $i$ has LCP $h$ with its predecessor in sorted order, then suffix $i + 1$ has LCP $\geq h - 1$ . This holds because removing the first character from both suffixes reduces their LCP by exactly 1. Without this insight, Kasai's algorithm appears to be $O (n^{2})$ (two nested loops), but the amortized argument via $h$ only decreasing by 1 per step makes it linear.

The sentinel character is load-bearing, not cosmetic. Appending '$' ensures no suffix is a prefix of another. Without it, the ordering of "an" vs "ana" is ambiguous (depending on how you break ties for unequal-length strings), and the prefix-doubling construction produces wrong ranks in intermediate rounds. Every formula involving the LCP array assumes the sentinel exists -- removing it silently corrupts the distinct-substring count and longest-repeated-substring answers.

Think of the suffix array as a sorted phone book for substrings. Every substring of $s$ is a prefix of some suffix. Once suffixes are sorted, binary search finds any pattern in $O (m \log n)$ -- or $O (m + \log n)$ with LCP acceleration. The mental model is: you pre-sort all the "entries" once, then answer arbitrary lookup queries cheaply. This framing instantly tells you when SA is the right tool -- anytime you need to answer many independent substring queries against a fixed text, SA is essentially a precomputed index.

The "two strings" trick is more general than it looks. Concatenating two strings with a separator ( $s_{1} \cdot $ \cdot s_{2}$ ) and building one suffix array is the standard approach for longest common substring. But this extends naturally: concatenate $k$ strings with distinct separators, build one SA + LCP, then sweep the LCP array with a sliding window to find substrings common to at least $t$ of the $k$ strings. This "merge everything into one SA" pattern solves an entire class of multi-string problems that would be painful with other approaches.

SA vs. SAM: know when each wins. The suffix array and suffix automaton solve many of the same problems, but their ergonomics differ. SA is better when you need the sorted order of suffixes (lexicographic queries, $k$ -th substring) or when the problem reduces to range queries on the LCP array. SAM is better for online construction (characters arrive one at a time) and when you need to count occurrences or traverse substring structure as a DAG. For offline problems on a single string, they're roughly interchangeable -- pick whichever you've drilled more.

When to Reach for This

Trigger phrases in problem statements:

"longest repeated substring" $\to$ build SA + LCP, answer is $max (lcp)$ .
"count distinct substrings" $\to$ SA + LCP, answer is $\frac{n (n + 1)}{2} - \sum lcp [i]$ .
"lexicographic order of suffixes / rotations" $\to$ direct SA application.
"LCP queries between arbitrary suffixes" $\to$ SA + LCP + sparse table RMQ.
"longest common substring of two strings" $\to$ concatenate with separator, SA + LCP scan.

Counter-examples -- do NOT use suffix array when:

Single pattern search in a text $\to$ use KMP or Z-function in $O (n + m)$ .
Dictionary of multiple patterns $\to$ use Aho-Corasick.
Only need hash-based substring comparison $\to$ use string hashing.
String is short ( $n \leq 5000$ ) and problem is simple $\to$ brute force or DP may suffice.

C++ STL Reference

Function / Class	Header	What it does	Time Complexity
`sort(first, last)`	`<algorithm>`	Used in each round of SA construction.	$O (n \log n)$
`stable_sort(first, last, cmp)`	`<algorithm>`	Alternative for stable rank assignment.	$O (n \log n)$
`string::substr(pos, len)`	`<string>`	Extract substring (debugging only -- $O (l e n)$ ).	$O (len)$
`lower_bound(first, last, val, cmp)`	`<algorithm>`	Binary search on SA for pattern matching.	$O (\log n)$
`upper_bound(first, last, val, cmp)`	`<algorithm>`	Upper end of pattern match range in SA.	$O (\log n)$
`iota(first, last, val)`	`<numeric>`	Fill SA with $0, 1, \dots, n - 1$ before sorting.	$O (n)$
`string::compare(pos, len, other, opos, olen)`	`<string>`	Lexicographic comparison of substrings.	$O (min (len, olen))$

Implementation (Contest-Ready)

sa[i] vs rank[i] -- They're Inverses

sa[i] = which suffix is at position i in sorted order (sa maps rank -> suffix)
rank[i] = where suffix starting at position i sits in sorted order (rank maps suffix -> rank)
rank[sa[i]] = i and sa[rank[i]] = i -- they're inverse permutations.

Confusing the two is one of the most common suffix array bugs. When you want "the ith suffix in sorted order," use sa[i]. When you want "where does suffix i appear in the sorted order," use rank[i].

The Sentinel Character

Append a character SMALLER than any character in the string (typically $ or \0). This ensures no suffix is a prefix of another suffix, which is required for the suffix array construction to produce a unique, well-defined ordering.

Without the sentinel, suffixes like "ab" and "abc" have ambiguous ordering (is "ab" < "abc"? depends on convention). The sentinel makes "ab $" < " a b c$ " unambiguous.

Version 1: Minimal Contest Template -- SA + LCP

cpp

#include <bits/stdc++.h>
using namespace std;

// O(n log^2 n) suffix array construction
vector<int> build_sa(const string& s) {
    int n = s.size();
    vector<int> sa(n), rank_(n), tmp(n);
    iota(sa.begin(), sa.end(), 0);
    for (int i = 0; i < n; i++) rank_[i] = s[i];

    for (int k = 1; k < n; k <<= 1) {
        auto cmp = [&](int a, int b) {
            if (rank_[a] != rank_[b]) return rank_[a] < rank_[b];
            int ra = a + k < n ? rank_[a + k] : -1;
            int rb = b + k < n ? rank_[b + k] : -1;
            return ra < rb;
        };
        sort(sa.begin(), sa.end(), cmp);
        tmp[sa[0]] = 0;
        for (int i = 1; i < n; i++)
            tmp[sa[i]] = tmp[sa[i - 1]] + (cmp(sa[i - 1], sa[i]) ? 1 : 0);
        rank_ = tmp;
        if (rank_[sa[n - 1]] == n - 1) break;
    }
    return sa;
}

// O(n) Kasai's algorithm for LCP array
vector<int> build_lcp(const string& s, const vector<int>& sa) {
    int n = s.size();
    vector<int> rank_(n), lcp(n, 0);
    for (int i = 0; i < n; i++) rank_[sa[i]] = i;

    int h = 0;
    for (int i = 0; i < n; i++) {
        if (rank_[i] > 0) {
            int j = sa[rank_[i] - 1];
            while (i + h < n && j + h < n && s[i + h] == s[j + h]) h++;
            lcp[rank_[i]] = h;
            if (h > 0) h--;
        }
    }
    return lcp;
}

int main() {
    ios::sync_with_stdio(false);
    cin.tie(nullptr);

    string s;
    cin >> s;
    s += '$';
    vector<int> sa = build_sa(s);
    vector<int> lcp = build_lcp(s, sa);

    int n = s.size();
    long long distinct = (long long)n * (n - 1) / 2;
    for (int i = 1; i < n; i++) distinct -= lcp[i];
    cout << distinct << "\n";
}

Version 2: Explained Version -- SA + LCP with Comments

cpp

#include <bits/stdc++.h>
using namespace std;

// Build suffix array in O(n log^2 n) using prefix-doubling.
// Each round sorts suffixes by their first 2^k characters,
// using the ranking from the previous round as sort keys.
vector<int> build_sa(const string& s) {
    int n = s.size();
    vector<int> sa(n), rank_(n), tmp(n);

    // Initialize: sa[i] = i, rank by single character
    iota(sa.begin(), sa.end(), 0);
    for (int i = 0; i < n; i++) rank_[i] = s[i];

    for (int k = 1; k < n; k <<= 1) {
        // Compare suffix i vs suffix j by the pair:
        //   (rank[i], rank[i+k]) vs (rank[j], rank[j+k])
        // This effectively compares the first 2k characters.
        auto cmp = [&](int a, int b) {
            if (rank_[a] != rank_[b]) return rank_[a] < rank_[b];
            // If i+k is out of bounds, that suffix is shorter -> rank = -1
            int ra = a + k < n ? rank_[a + k] : -1;
            int rb = b + k < n ? rank_[b + k] : -1;
            return ra < rb;
        };
        sort(sa.begin(), sa.end(), cmp);

        // Re-assign ranks based on new sorted order
        // Equal pairs get the same rank
        tmp[sa[0]] = 0;
        for (int i = 1; i < n; i++)
            tmp[sa[i]] = tmp[sa[i - 1]] + (cmp(sa[i - 1], sa[i]) ? 1 : 0);
        rank_ = tmp;

        // Early termination: all ranks are unique
        if (rank_[sa[n - 1]] == n - 1) break;
    }
    return sa;
}

// Kasai's algorithm: compute LCP array in O(n).
// Key insight: if suffix at position i has lcp[rank[i]] = h with
// its predecessor in SA order, then the suffix at position i+1
// has lcp >= h-1 with its predecessor.
// We iterate in text order (i = 0..n-1) and maintain h.
vector<int> build_lcp(const string& s, const vector<int>& sa) {
    int n = s.size();
    vector<int> rank_(n), lcp(n, 0);

    // Build rank (inverse of sa): rank[sa[i]] = i
    for (int i = 0; i < n; i++) rank_[sa[i]] = i;

    int h = 0; // current LCP length, can only decrease by 1 each step
    for (int i = 0; i < n; i++) {
        if (rank_[i] > 0) {
            // j = suffix that comes just before suffix i in sorted order
            int j = sa[rank_[i] - 1];
            // Extend match from current h
            while (i + h < n && j + h < n && s[i + h] == s[j + h]) h++;
            lcp[rank_[i]] = h;
            // Next iteration: h decreases by at most 1
            if (h > 0) h--;
        }
    }
    return lcp;
}

int main() {
    ios::sync_with_stdio(false);
    cin.tie(nullptr);

    string s;
    cin >> s;
    // Append sentinel: must be < all other characters in s.
    // This ensures no suffix is a prefix of another suffix,
    // which simplifies comparisons.
    s += '$';

    vector<int> sa = build_sa(s);
    vector<int> lcp = build_lcp(s, sa);

    int n = s.size();

    // Number of distinct substrings:
    // Total substrings = n*(n-1)/2 (excluding sentinel suffixes)
    // Each lcp[i] accounts for lcp[i] substrings already counted
    long long distinct = (long long)n * (n - 1) / 2;
    for (int i = 1; i < n; i++) distinct -= lcp[i];
    cout << distinct << "\n";
}

Operations Reference

The suffix array is a build-once structure -- construction dominates the cost, but every query afterward is fast. This table shows why the upfront investment pays off.

Operation	Time	Space
Build suffix array ( $O (n \log^{2} n)$ version)	$O (n \log^{2} n)$	$O (n)$
Build suffix array ( $O (n \log n)$ with radix sort)	$O (n \log n)$	$O (n)$
Build LCP array (Kasai)	$O (n)$	$O (n)$
Count distinct substrings	$O (n)$ after LCP	$O (1)$ extra
Longest repeated substring	$O (n)$ after LCP	$O (1)$ extra
Pattern search (single pattern of length $m$ )	$O (m \log n)$	$O (1)$ extra
LCP of arbitrary pair $(sa [i], sa [j])$ via RMQ	$O (1)$ per query	$O (n \log n)$ sparse table
Longest common substring (two strings)	$O ((n + m) \log^{2} (n + m))$ build + $O (n + m)$ scan	$O (n + m)$

Problem Patterns & Tricks

Pattern 1: Number of Distinct Substrings

The total number of substrings is $\frac{n (n + 1)}{2}$ . Each $lcp [i]$ represents shared prefixes between consecutive sorted suffixes -- substrings already counted. Subtract them all.

distinct = \frac{n_{orig} (n_{orig} + 1)}{2} - \sum_{i = 1}^{n - 1} lcp [i]

where $n_{orig}$ is the original string length (without sentinel) and $n = n_{orig} + 1$ . Equivalently with sentinel-inclusive $n$ : $\frac{n (n - 1)}{2} - \sum lcp [i]$ .

cpp

long long ans = 0;
for (int i = 1; i < n; i++)
    ans += (sa[i] == 0 ? n : n - sa[i]) - lcp[i];
// each suffix sa[i] contributes (n - sa[i]) substrings minus lcp[i] duplicates

CF examples: CF 802I -- Fake News (hard), SPOJ DISUBSTR.

Pattern 2: Longest Repeated Substring

The longest substring that appears at least twice is the maximum value in the LCP array. To find it, scan $lcp []$ for the max, then extract the substring starting at $sa [argmax]$ with that length.

cpp

int best = *max_element(lcp.begin() + 1, lcp.end());
// The repeated substring is s.substr(sa[argmax], best)

For "longest substring appearing at least $k$ times," slide a window of size $k - 1$ over the LCP array and take the minimum in each window. The answer is the maximum of those minima (use a monotonic queue for $O (n)$ ).

CF examples: CF 1043G -- Speckled Band, SPOJ REPEATS.

Pattern 3: Pattern Matching via Binary Search on SA

All occurrences of pattern $p$ in $s$ form a contiguous range in the suffix array (since SA is sorted). Binary search for the lower and upper bounds of that range. Each comparison costs $O (m)$ , total $O (m \log n)$ .

cpp

// Find range [lo, hi) in sa where s[sa[i]..] starts with pattern p
auto cmp_lo = [&](int sa_i, const string& p) {
    return s.compare(sa_i, p.size(), p) < 0;
};
auto cmp_hi = [&](const string& p, int sa_i) {
    return s.compare(sa_i, p.size(), p) > 0;
};
int lo = lower_bound(sa.begin(), sa.end(), p, cmp_lo) - sa.begin();
int hi = upper_bound(sa.begin(), sa.end(), p, cmp_hi) - sa.begin();
int count = hi - lo; // number of occurrences

CF examples: CF 149E -- Martian Strings.

Pattern 4: Longest Common Substring of Two Strings

Concatenate $s_{1} + ’#’ + s_{2} + ’$’$ (using separators not in either string). Build SA and LCP on the concatenated string. The answer is the maximum $lcp [i]$ where $sa [i]$ and $sa [i - 1]$ come from different original strings.

cpp

string t = s1 + '#' + s2 + '$';
auto sa = build_sa(t);
auto lcp = build_lcp(t, sa);
int sep = s1.size(); // index of '#'
int ans = 0;
for (int i = 1; i < (int)t.size(); i++) {
    bool from1_prev = sa[i - 1] < sep;
    bool from1_curr = sa[i] < sep;
    if (from1_prev != from1_curr)
        ans = max(ans, lcp[i]);
}

CF examples: CF 1526E -- Oolimry, SPOJ LCS.

  Longest common substring via concatenation + SA + LCP:

  s1 = "abcde"    s2 = "bcdef"

  Concatenated: "abcde#bcdef$"
                 ├─s1──┤ ├─s2──┤
                 0    4 5 6   10 11

  After SA + LCP construction, scan adjacent pairs:
  
  sa[i-1]  sa[i]   From    LCP
  ──────── ──────── ─────── ────
  ...
  1        6        s1, s2  4     <- "bcde" in both!
  ...

  Answer: max LCP where sa[i-1] and sa[i] come from
  different strings = 4 -> "bcde"

  Separator '#' ensures no false cross-boundary matches.

Pattern 5: SA + Sparse Table RMQ for Arbitrary LCP Queries

To find the LCP of any two suffixes $sa [i]$ and $sa [j]$ (not necessarily adjacent), compute $min (lcp [i + 1], \dots, lcp [j])$ . Build a sparse table on the LCP array for $O (1)$ range-minimum queries after $O (n \log n)$ preprocessing.

cpp

// Sparse table over lcp[] for O(1) RMQ
vector<vector<int>> sp(LOG, vector<int>(n));
sp[0] = lcp;
for (int j = 1; j < LOG; j++)
    for (int i = 0; i + (1 << j) <= n; i++)
        sp[j][i] = min(sp[j-1][i], sp[j-1][i + (1 << (j-1))]);

auto query = [&](int l, int r) { // LCP of sa[l] and sa[r], l < r
    l++; // lcp is between consecutive, so range is [l+1..r]
    int k = __lg(r - l + 1);
    return min(sp[k][l], sp[k][r - (1 << k) + 1]);
};

CF examples: CF 873F -- Forbidden Indices.

Pattern 6: Counting Substrings with Property via SA + Binary Search

Many problems ask "how many substrings satisfy X?" where X depends on length or lexicographic order. Group suffixes by LCP ranges. Binary search on the LCP array to find how far a common prefix extends. Often combined with monotonic stack on LCP values.

CF examples: CF 432D -- Prefixes and Suffixes, CF 427D -- Match & Catch.

Contest Cheat Sheet

+--------------------------------------------------------------+
|                  SUFFIX ARRAY CHEAT SHEET                     |
+--------------------------------------------------------------+
| WHEN TO USE:                                                  |
|   - Count distinct substrings        --> SA + LCP sum         |
|   - Longest repeated substring       --> max(lcp[])           |
|   - Pattern matching (multi-query)   --> binary search on SA  |
|   - Longest common substring (2 str) --> concat + SA + LCP    |
|   - Any "all substrings" query       --> SA + LCP + RMQ       |
+--------------------------------------------------------------+
| CONSTRUCTION:                                                 |
|   s += '$';           // sentinel, smaller than all chars     |
|   sa = build_sa(s);   // O(n log^2 n) or O(n log n)          |
|   lcp = build_lcp(s, sa);  // Kasai, O(n)                    |
+--------------------------------------------------------------+
| KEY FORMULAS:                                                 |
|   distinct substrings = n*(n-1)/2 - sum(lcp[1..n-1])         |
|   longest repeated    = max(lcp[1..n-1])                      |
|   occurrences of P    = upper_bound - lower_bound on SA       |
+--------------------------------------------------------------+
| COMPLEXITY:                                                   |
|   Build SA:   O(n log^2 n) time,  O(n) space                 |
|   Build LCP:  O(n) time,          O(n) space                  |
|   Pattern:    O(m log n) per query                            |
|   RMQ on LCP: O(n log n) build,   O(1) per query             |
+--------------------------------------------------------------+
| WATCH OUT:                                                    |
|   - Append sentinel '$' before building                       |
|   - lcp[0] is undefined (no predecessor)                      |
|   - rank[] is INVERSE of sa[]: rank[sa[i]] = i               |
+--------------------------------------------------------------+

Gotchas & Debugging

Forgetting the Sentinel

Without a sentinel character smaller than all characters in $s$ , suffix comparisons break when one suffix is a prefix of another. Always append '$' (or character 0) and adjust your index arithmetic accordingly.

Off-by-One in LCP

lcp[i] is the LCP between sa[i-1] and sa[i]. There is no lcp[0] (no predecessor). Loops should start at i = 1.

Rank vs SA Confusion

sa[i] = starting index of the $i$ -th suffix in sorted order.
rank[i] = position of suffix starting at index $i$ in the sorted order.
They are inverses: rank[sa[i]] = i and sa[rank[i]] = i.

Mixing them up causes silent wrong answers. Print both arrays during debugging.

Sentinel in Distinct Substring Count

If you appended '$', the sentinel contributes extra suffixes. The formula $\frac{n (n + 1)}{2} - \sum lcp$ counts substrings including those containing '$'. Adjust: use original string length, or subtract suffixes containing the sentinel.

Integer Overflow

For strings of length $n \leq 10^{5}$ , the number of distinct substrings can reach $\sim 5 \times 10^{9}$ , exceeding int. Use long long.

Debugging Tips

Print the sorted suffixes alongside the SA and LCP arrays for small inputs.
Verify rank[sa[i]] == i for all $i$ .
Verify lcp[i] >= 0 and lcp[i] < n for all valid $i$ .
Test on "aaaa" (all same characters) and "abcde" (all distinct) as corner cases.

Mental Traps

Trap 1: "Suffix array is just sorting suffixes -- why is it hard?"

Sorting $n$ suffixes naively costs $O (n^{2} \log n)$ because each comparison is $O (n)$ . The "hard" part is prefix doubling: using previous rounds' ranks to make comparisons $O (1)$ , reducing total work to $O (n \log^{2} n)$ or $O (n \log n)$ with radix sort. If you think "just sort," you'll TLE on $n = 10^{5}$ .

  Naive sort vs. prefix-doubling:

  Naive:
  ┌─────────────────────────────────────┐
  │  n suffixes x O(n) per comparison   │
  │  x O(n log n) comparisons           │
  │  = O(n² log n)                      │  <- TLE for n >= 10⁴
  └─────────────────────────────────────┘

  Prefix-doubling:
  ┌─────────────────────────────────────┐
  │  O(log n) rounds                    │
  │  x O(n log n) sort per round        │
  │  x O(1) per comparison (rank pairs) │
  │  = O(n log² n)                      │  <- OK for n <= 10⁵
  └─────────────────────────────────────┘

Trap 2: "I'll compute LCP on-the-fly instead of building the array."

For a single pair, computing LCP by character comparison is fine: $O (lcp)$ time. But for "longest repeated substring" you need $max (lcp [i])$ across the entire array, and for "distinct substrings" you need $\sum lcp [i]$ . Computing each LCP from scratch is $O (n)$ per pair -> $O (n^{2})$ total. Kasai's algorithm gives the full array in $O (n)$ via the $h \geq h_{prev} - 1$ property.

Common Mistakes

Kasai iteration order. "My LCP array had garbage values for some positions, and distinct-substring count was way off..." In Kasai's algorithm, iterating in suffix-array order (for i in sa[0], sa[1], ...) instead of text order (for i in 0, 1, ..., n-1) breaks the amortization. Kasai's algorithm exploits the fact that lcp[rank[i]] >= lcp[rank[i-1]] - 1 -- this inequality only holds when iterating by text position. Iterating in SA order produces wrong values.

Debug Drills

Drill 1 -- Kasai's algorithm: wrong iteration order

cpp

int k = 0;
for (int i = 0; i < n; i++) {
    int j = sa[rank[i] - 1];  // BUG: no bounds check
    while (s[i + k] == s[j + k]) k++;
    lcp[rank[i]] = k;
    if (k) k--;
}

What's wrong?

When rank[i] == 0, rank[i] - 1 = -1 is out of bounds. The first suffix in sorted order has no predecessor, so its LCP is 0. Add a guard:

cpp

if (rank[i] == 0) { k = 0; continue; }

Drill 2 -- Suffix array sort: not breaking ties correctly

cpp

// Doubling step: sort by (rank[i], rank[i + w])
sort(sa.begin(), sa.end(), [&](int a, int b) {
    return rank[a] < rank[b] || 
           (rank[a] == rank[b] && rank[a + w] < rank[b + w]);  // BUG
});

What's wrong?

When a + w >= n or b + w >= n, rank[a + w] is out of bounds. Suffixes that are shorter than w remaining characters should be treated as having rank -1 (or INT_MIN):

cpp

auto second = [&](int x) { return x + w < n ? rank[x + w] : -1; };
sort(sa.begin(), sa.end(), [&](int a, int b) {
    return rank[a] < rank[b] || (rank[a] == rank[b] && second(a) < second(b));
});

Drill 3 -- Distinct substring count: off-by-one with LCP sum

cpp

long long total = (long long)n * (n + 1) / 2;
for (int i = 0; i < n; i++)  // BUG: LCP[0] is usually 0 or undefined
    total -= lcp[i];

What's wrong?

Depending on convention, lcp[0] might be undefined (it's the LCP between the "suffix before the first" and the first -- which doesn't exist). Typically lcp[0] = 0 by definition, so this may or may not be a bug. But if your LCP array is 1-indexed (common in some templates), you should iterate from i = 1. Verify your template's convention.

Self-Test

Before moving to problems, verify you can do these without looking at the notes:

[ ] Implement $O (n \log^{2} n)$ SA construction from memory: initialize ranks from characters, iterate doubling rounds, sort by (rank[i], rank[i+k]) pairs, re-rank, early-terminate when all ranks are distinct
[ ] Implement Kasai's LCP algorithm from memory, including the inverse-SA computation and the h-- step
[ ] State the inverse relationship rank[sa[i]] = i and verify on "banana$" -- explain what bug arises from confusing sa and rank
[ ] Compute distinct substrings from SA + LCP: $\frac{n (n + 1)}{2} - \sum lcp [i]$ , and explain why lcp[0] is excluded
[ ] Describe how to find all occurrences of pattern $P$ in string $S$ via binary search on the suffix array, and state the per-query complexity

Practice Problems

CF Rating	What You Should Be Able to Do
1200	Understand what a suffix array is conceptually; sort suffixes by brute force.
1500	Implement $O (n \log^{2} n)$ suffix array construction; compute the LCP array using Kasai's algorithm.
1800	Use SA + LCP + RMQ for pattern search, counting distinct substrings, and longest common substring of two strings.
2100	Combine SA + LCP with monotonic stack, binary search, or segment tree for advanced problems (e.g., sum of distinct substring lengths, LCP-based grouping).

#	Problem	Source	Difficulty	Key Idea
1	Distinct Substrings	SPOJ	1400	SA + LCP, count distinct substrings
2	Longest Common Substring	SPOJ	1500	Concatenate two strings, SA + LCP scan
3	String Matching	CSES	1500	Binary search on SA for pattern
4	Prefixes and Suffixes	CF 432D	1900	SA + LCP + counting
5	Match & Catch	CF 427D	2000	SA + LCP on concatenated strings
6	Forbidden Indices	CF 873F	2100	SA + LCP + sparse table RMQ
7	Martian Strings	CF 149E	2300	SA + LCP, substring search
8	Repeats	SPOJ	2200	SA + LCP + binary search on length
9	Speckled Band	CF 1043G	2400	SA + LCP, heavy case analysis
10	Fake News (hard)	CF 802I	2300	SA + LCP + monotonic stack

Suffix Array ​

Contents ​

Intuition ​

Visual Walkthrough ​

Past the construction: what the LCP array buys you ​

Step-by-Step Trace: Kasai's LCP Algorithm for "banana$" ​

The Invariant ​

What the Code Won't Teach You ​

When to Reach for This ​

C++ STL Reference ​

Implementation (Contest-Ready) ​

sa[i] vs rank[i] -- They're Inverses ​

The Sentinel Character ​

Version 1: Minimal Contest Template -- SA + LCP ​

Version 2: Explained Version -- SA + LCP with Comments ​

Operations Reference ​

Problem Patterns & Tricks ​

Pattern 1: Number of Distinct Substrings ​

Pattern 2: Longest Repeated Substring ​

Pattern 3: Pattern Matching via Binary Search on SA ​

Pattern 4: Longest Common Substring of Two Strings ​

Pattern 5: SA + Sparse Table RMQ for Arbitrary LCP Queries ​

Pattern 6: Counting Substrings with Property via SA + Binary Search ​

Contest Cheat Sheet ​

Gotchas & Debugging ​

Forgetting the Sentinel ​

Off-by-One in LCP ​

Rank vs SA Confusion ​

Sentinel in Distinct Substring Count ​

Integer Overflow ​

Debugging Tips ​

Mental Traps ​

Common Mistakes ​

Debug Drills ​

Self-Test ​

Practice Problems ​

Further Reading ​

Suffix Array

Contents

Intuition

Visual Walkthrough

Past the construction: what the LCP array buys you

Step-by-Step Trace: Kasai's LCP Algorithm for `"banana$"`

The Invariant

What the Code Won't Teach You

When to Reach for This

C++ STL Reference

Implementation (Contest-Ready)

sa[i] vs rank[i] -- They're Inverses

The Sentinel Character

Version 1: Minimal Contest Template -- SA + LCP

Version 2: Explained Version -- SA + LCP with Comments

Operations Reference

Problem Patterns & Tricks

Pattern 1: Number of Distinct Substrings

Pattern 2: Longest Repeated Substring

Pattern 3: Pattern Matching via Binary Search on SA

Pattern 4: Longest Common Substring of Two Strings

Pattern 5: SA + Sparse Table RMQ for Arbitrary LCP Queries

Pattern 6: Counting Substrings with Property via SA + Binary Search

Contest Cheat Sheet

Gotchas & Debugging

Forgetting the Sentinel

Off-by-One in LCP

Rank vs SA Confusion

Sentinel in Distinct Substring Count

Integer Overflow

Debugging Tips

Mental Traps

Common Mistakes

Debug Drills

Self-Test

Practice Problems

Further Reading