Skip to content

Suffix Array

Quick Revisit

  • USE WHEN: Distinct substrings, longest repeated substring, pattern matching over all suffixes
  • INVARIANT: sa[] is a permutation of [0..n-1] with suffix sa[i] < suffix sa[i+1] lexicographically; LCP array stores shared prefix lengths of adjacent sorted suffixes
  • TIME: O(n log n) build, O(m log n) pattern search | SPACE: O(n)
  • CLASSIC BUG: Missing sentinel character ($ < all others) so the last suffix sorts correctly
  • PRACTICE: 07-ladder-strings

Sort every suffix of a string, keep only the starting indices. This O(nlogn) structure answers substring queries -- distinct substrings, longest repeated substring, pattern matching -- that would otherwise require a suffix tree.

Difficulty: [Advanced]Prerequisites: KMP and Z-Function, Sorting and SearchingCF Rating Relevance: 1900-2200+

Contest Frequency: * Specialized -- appears in advanced string problems


Contents

Intuition

Take the string "banana". You want to find every occurrence of a pattern, or the longest repeated substring. The brute-force plan: enumerate all n suffixes, sort them lexicographically, then scan for answers.

The problem? Comparing two suffixes costs O(n) in the worst case, and sorting n items with comparison sort is O(nlogn) comparisons, so naive construction is O(n2logn). For n=105 that is roughly 1011.7 character comparisons -- way too slow.

We need a way to sort suffixes where each comparison is O(1), not O(n).

Sort suffixes by doubling -- first by 1 character, then 2, then 4, ... -- using the previous round's ranks to compare in O(1), giving O(nlogn) total.

Here is the idea built from scratch:

  1. Round 0. Rank every suffix by its first character alone. This is just sorting single characters -- O(n) via counting sort, or O(nlogn) via comparison sort.

  2. Round 1. Now sort by the first 2 characters. But we already know the rank of every suffix by its first character. So the first-2-character key for suffix i is the pair (rank[i],rank[i+1]). Comparing two pairs of integers is O(1).

  3. Round 2. Sort by the first 4 characters. The key for suffix i is (rank[i],rank[i+2]) -- the rank of the first-2 block and the rank of the second-2 block (which starts at i+2). Again O(1) per comparison.

  4. General round k. Sort by the first 2k characters using the pair (rank[i],rank[i+2k1]). Each round is one sort of n items with O(1) comparisons.

  5. Termination. After log2n rounds the prefix length exceeds n, so we are sorting by the full suffix. Total: O(logn) rounds × O(nlogn) sort =O(nlog2n), or O(nlogn) if each round uses radix sort on pairs.

Once the suffix array is built, the LCP array (longest common prefix between each pair of adjacent sorted suffixes) solves "longest repeated substring" in O(n) via Kasai's algorithm, which exploits the identity lcp[i]lcp[i1]1 when traversing in text order.

Visual Walkthrough

String: "banana$" (n=7, $ is a sentinel smaller than all letters.)

All suffixes:

  i=0: banana$
  i=1: anana$
  i=2: nana$
  i=3: ana$
  i=4: na$
  i=5: a$
  i=6: $

Round k=0 -- rank by first 1 character:

  char:  b   a   n   a   n   a   $
  rank: [1,  0,  2,  0,  2,  0, -1]
        ($ mapped to -1, a=0, b=1, n=2)

Round k=1 -- rank by first 2 characters, key = (rank[i], rank[i+1]):

  i : pair            meaning
  0 : (1, 0)          "ba"
  1 : (0, 2)          "an"
  2 : (2, 0)          "na"
  3 : (0, 2)          "an"
  4 : (2, 0)          "na"
  5 : (0,-1)          "a$"
  6 : (-1,*)          "$"   (* = out-of-bounds -> -inf)

  Sort pairs -> re-rank:
  rank: [3,  2,  4,  2,  4,  1,  0]
         b   a   n   a   n   a   $

Suffixes 1 and 3 share rank 2 (both start with "an"), and suffixes 2 and 4 share rank 4 (both start with "na"). Ties remain -- we need more rounds.

Round k=2 -- rank by first 4 characters, key = (rank[i], rank[i+2]):

  i : pair            meaning
  0 : (3, 4)          "bana"
  1 : (2, 2)          "anan"
  2 : (4, 4)          "nana"
  3 : (2, 1)          "ana$"
  4 : (4, 0)          "na$"
  5 : (1, *)          "a$"
  6 : (0, *)          "$"

  Sort pairs in ascending order, then re-rank:
  rank: [4,  3,  6,  2,  5,  1,  0]
         b   a   n   a   n   a   $

Every pair above is distinct (the second component breaks the round-1 ties on "an" and "na"), so each suffix gets a unique rank. We can stop -- once all n ranks are distinct, the sort is complete.

Final suffix array (read off sorted order):

  sa[0] = 6   $
  sa[1] = 5   a$
  sa[2] = 3   ana$
  sa[3] = 1   anana$
  sa[4] = 0   banana$
  sa[5] = 4   na$
  sa[6] = 2   nana$

That is the suffix array: a permutation of [0,n) ordering suffixes lexicographically. By itself it already solves "kth smallest substring" and "binary-search a pattern in O(mlogn)." Most of the famous applications, though, need one more array.

Past the construction: what the LCP array buys you

After the suffix array is built, the LCP array records the longest common prefix between each pair of adjacent sorted suffixes. The reason that is enough -- the reason we never need LCPs of arbitrary pairs to solve "longest repeated substring" or "count distinct substrings" -- is a neat structural fact:

For any i<j in sorted order, lcp(sa[i],sa[j])=min(lcp[i+1],,lcp[j]).

So adjacent LCPs subsume non-adjacent ones via range-minimum queries. That is why "distinct substrings" only needs lcp[i] and "longest repeated substring" only needs maxlcp[i].

Kasai's LCP computation (traverse in text order, maintain height h):

  Sorted suffixes:             LCP with predecessor:
  sa[0]=6:  $                  ---
  sa[1]=5:  a$                 lcp[1] = 0
  sa[2]=3:  ana$               lcp[2] = 1   ("a")
  sa[3]=1:  anana$             lcp[3] = 3   ("ana")
  sa[4]=0:  banana$            lcp[4] = 0
  sa[5]=4:  na$                lcp[5] = 0
  sa[6]=2:  nana$              lcp[6] = 2   ("na")

  lcp[] = { -, 0, 1, 3, 0, 0, 2 }

  Longest repeated substring: max(lcp) = 3  -->  "ana"
  Distinct substrings = 7*6/2 - (0+1+3+0+0+2) = 21 - 6 = 15
  (n=7 includes sentinel; formula is n*(n-1)/2 to exclude '$' substrings)

Step-by-Step Trace: Kasai's LCP Algorithm for "banana$"

Kasai traverses in text order (i = 0, 1, ..., 6), not SA order. The variable h carries forward: it decreases by at most 1 between steps.

  SA:   [6, 5, 3, 1, 0, 4, 2]
  rank: [4, 3, 6, 2, 5, 1, 0]     (inverse of SA: rank[sa[j]] = j)

  | Step | i | suffix    | rank[i] | SA predecessor         | h start | Compare              | lcp | h end |
  |------|---|-----------|---------|------------------------|---------|----------------------|-----|-------|
  |  1   | 0 | banana$   |    4    | sa[3]=1 -> anana$       |    0    | 'b' != 'a' -> stop    |  0  |   0   |
  |  2   | 1 | anana$    |    3    | sa[2]=3 -> ana$         |    0    | a=a, n=n, a=a OK      |  3  |   3   |
  |      |   |           |         |                        |         | then 'n' != '$' -> stop|     |       |
  |  3   | 2 | nana$     |    6    | sa[5]=4 -> na$          |  3-1=2  | s[4]='n' != '$' ->stop |  2  |   2   |
  |  4   | 3 | ana$      |    2    | sa[1]=5 -> a$           |  2-1=1  | s[4]='n' != '$' ->stop |  1  |   1   |
  |  5   | 4 | na$       |    5    | sa[4]=0 -> banana$      |  1-1=0  | 'n' != 'b' -> stop    |  0  |   0   |
  |  6   | 5 | a$        |    1    | sa[0]=6 -> $            |    0    | 'a' != '$' -> stop    |  0  |   0   |
  |  7   | 6 | $         |    0    | (rank=0, no pred.)     |    0    | -- skip --             |  --  |   0   |

  Result: lcp[] = [ -, 0, 1, 3, 0, 0, 2 ]

  Key: h dropped 3->2->1->0 (one per step), never jumped down.
  Total extensions across all steps <= 2n, giving Kasai's O(n) bound.

The Invariant

After round k, the rank array correctly orders all suffixes by their first 2k characters. Each round reuses the previous ranks: the sort key for suffix i is the pair (rank[i],rank[i+2k1]), which encodes 2k characters in O(1).

This is why doubling works: two known halves of length 2k1 are combined into a single key of length 2k with no extra character comparisons. The number of distinct ranks can only increase or stay the same each round, and once all n ranks are unique the suffix array is complete.

  Prefix-doubling: how each round doubles the comparison window

  Round 0 (k=1): rank by first 1 character
  ┌───┐
  │ b │ a  n  a  n  a  $
  └───┘
  key = rank[i]

  Round 1 (k=2): rank by first 2 characters
  ┌───┬───┐
  │ b │ a │ n  a  n  a  $
  └───┴───┘
  key = (rank[i], rank[i+1])

  Round 2 (k=4): rank by first 4 characters
  ┌───┬───┬───┬───┐
  │ b │ a │ n │ a │ n  a  $
  └───┴───┴───┴───┘
  key = (rank[i], rank[i+2])  <- reuses round 1 ranks!

  Round 3 (k=8 > n): all suffixes fully compared. DONE.

  Each round: one sort of n pairs -> O(n log n)
  Total rounds: O(log n)  ->  O(n log² n) overall

What the Code Won't Teach You

O(1) LCP queries require a sparse table. The LCP between sa[i] and sa[j] is min(lcp[i+1],,lcp[j]) -- a range minimum query. Without a sparse table (or segment tree), each LCP query costs O(n), making problems that need many pairwise LCP comparisons infeasible. The code builds SA + LCP but not the RMQ structure -- you need to know to add it yourself.

Kasai's hhprev1 property is non-obvious. The key to Kasai's O(n) LCP construction: if suffix i has LCP h with its predecessor in sorted order, then suffix i+1 has LCP h1. This holds because removing the first character from both suffixes reduces their LCP by exactly 1. Without this insight, Kasai's algorithm appears to be O(n2) (two nested loops), but the amortized argument via h only decreasing by 1 per step makes it linear.

The sentinel character is load-bearing, not cosmetic. Appending '$' ensures no suffix is a prefix of another. Without it, the ordering of "an" vs "ana" is ambiguous (depending on how you break ties for unequal-length strings), and the prefix-doubling construction produces wrong ranks in intermediate rounds. Every formula involving the LCP array assumes the sentinel exists -- removing it silently corrupts the distinct-substring count and longest-repeated-substring answers.

Think of the suffix array as a sorted phone book for substrings. Every substring of s is a prefix of some suffix. Once suffixes are sorted, binary search finds any pattern in O(mlogn) -- or O(m+logn) with LCP acceleration. The mental model is: you pre-sort all the "entries" once, then answer arbitrary lookup queries cheaply. This framing instantly tells you when SA is the right tool -- anytime you need to answer many independent substring queries against a fixed text, SA is essentially a precomputed index.

The "two strings" trick is more general than it looks. Concatenating two strings with a separator (s1$s2) and building one suffix array is the standard approach for longest common substring. But this extends naturally: concatenate k strings with distinct separators, build one SA + LCP, then sweep the LCP array with a sliding window to find substrings common to at least t of the k strings. This "merge everything into one SA" pattern solves an entire class of multi-string problems that would be painful with other approaches.

SA vs. SAM: know when each wins. The suffix array and suffix automaton solve many of the same problems, but their ergonomics differ. SA is better when you need the sorted order of suffixes (lexicographic queries, k-th substring) or when the problem reduces to range queries on the LCP array. SAM is better for online construction (characters arrive one at a time) and when you need to count occurrences or traverse substring structure as a DAG. For offline problems on a single string, they're roughly interchangeable -- pick whichever you've drilled more.

When to Reach for This

Trigger phrases in problem statements:

  • "longest repeated substring" build SA + LCP, answer is max(lcp).
  • "count distinct substrings" SA + LCP, answer is n(n+1)2lcp[i].
  • "lexicographic order of suffixes / rotations" direct SA application.
  • "LCP queries between arbitrary suffixes" SA + LCP + sparse table RMQ.
  • "longest common substring of two strings" concatenate with separator, SA + LCP scan.

Counter-examples -- do NOT use suffix array when:

  • Single pattern search in a text use KMP or Z-function in O(n+m).
  • Dictionary of multiple patterns use Aho-Corasick.
  • Only need hash-based substring comparison use string hashing.
  • String is short (n5000) and problem is simple brute force or DP may suffice.

C++ STL Reference

Function / ClassHeaderWhat it doesTime Complexity
sort(first, last)<algorithm>Used in each round of SA construction.O(nlogn)
stable_sort(first, last, cmp)<algorithm>Alternative for stable rank assignment.O(nlogn)
string::substr(pos, len)<string>Extract substring (debugging only -- O(len)).O(len)
lower_bound(first, last, val, cmp)<algorithm>Binary search on SA for pattern matching.O(logn)
upper_bound(first, last, val, cmp)<algorithm>Upper end of pattern match range in SA.O(logn)
iota(first, last, val)<numeric>Fill SA with 0,1,,n1 before sorting.O(n)
string::compare(pos, len, other, opos, olen)<string>Lexicographic comparison of substrings.O(min(len,olen))

Implementation (Contest-Ready)

sa[i] vs rank[i] -- They're Inverses

  • sa[i] = which suffix is at position i in sorted order (sa maps rank -> suffix)
  • rank[i] = where suffix starting at position i sits in sorted order (rank maps suffix -> rank)
  • rank[sa[i]] = i and sa[rank[i]] = i -- they're inverse permutations.

Confusing the two is one of the most common suffix array bugs. When you want "the ith suffix in sorted order," use sa[i]. When you want "where does suffix i appear in the sorted order," use rank[i].

The Sentinel Character

Append a character SMALLER than any character in the string (typically $ or \0). This ensures no suffix is a prefix of another suffix, which is required for the suffix array construction to produce a unique, well-defined ordering.

Without the sentinel, suffixes like "ab" and "abc" have ambiguous ordering (is "ab" < "abc"? depends on convention). The sentinel makes "ab"<"abc" unambiguous.

Version 1: Minimal Contest Template -- SA + LCP

cpp
#include <bits/stdc++.h>
using namespace std;

// O(n log^2 n) suffix array construction
vector<int> build_sa(const string& s) {
    int n = s.size();
    vector<int> sa(n), rank_(n), tmp(n);
    iota(sa.begin(), sa.end(), 0);
    for (int i = 0; i < n; i++) rank_[i] = s[i];

    for (int k = 1; k < n; k <<= 1) {
        auto cmp = [&](int a, int b) {
            if (rank_[a] != rank_[b]) return rank_[a] < rank_[b];
            int ra = a + k < n ? rank_[a + k] : -1;
            int rb = b + k < n ? rank_[b + k] : -1;
            return ra < rb;
        };
        sort(sa.begin(), sa.end(), cmp);
        tmp[sa[0]] = 0;
        for (int i = 1; i < n; i++)
            tmp[sa[i]] = tmp[sa[i - 1]] + (cmp(sa[i - 1], sa[i]) ? 1 : 0);
        rank_ = tmp;
        if (rank_[sa[n - 1]] == n - 1) break;
    }
    return sa;
}

// O(n) Kasai's algorithm for LCP array
vector<int> build_lcp(const string& s, const vector<int>& sa) {
    int n = s.size();
    vector<int> rank_(n), lcp(n, 0);
    for (int i = 0; i < n; i++) rank_[sa[i]] = i;

    int h = 0;
    for (int i = 0; i < n; i++) {
        if (rank_[i] > 0) {
            int j = sa[rank_[i] - 1];
            while (i + h < n && j + h < n && s[i + h] == s[j + h]) h++;
            lcp[rank_[i]] = h;
            if (h > 0) h--;
        }
    }
    return lcp;
}

int main() {
    ios::sync_with_stdio(false);
    cin.tie(nullptr);

    string s;
    cin >> s;
    s += '$';
    vector<int> sa = build_sa(s);
    vector<int> lcp = build_lcp(s, sa);

    int n = s.size();
    long long distinct = (long long)n * (n - 1) / 2;
    for (int i = 1; i < n; i++) distinct -= lcp[i];
    cout << distinct << "\n";
}

Version 2: Explained Version -- SA + LCP with Comments

cpp
#include <bits/stdc++.h>
using namespace std;

// Build suffix array in O(n log^2 n) using prefix-doubling.
// Each round sorts suffixes by their first 2^k characters,
// using the ranking from the previous round as sort keys.
vector<int> build_sa(const string& s) {
    int n = s.size();
    vector<int> sa(n), rank_(n), tmp(n);

    // Initialize: sa[i] = i, rank by single character
    iota(sa.begin(), sa.end(), 0);
    for (int i = 0; i < n; i++) rank_[i] = s[i];

    for (int k = 1; k < n; k <<= 1) {
        // Compare suffix i vs suffix j by the pair:
        //   (rank[i], rank[i+k]) vs (rank[j], rank[j+k])
        // This effectively compares the first 2k characters.
        auto cmp = [&](int a, int b) {
            if (rank_[a] != rank_[b]) return rank_[a] < rank_[b];
            // If i+k is out of bounds, that suffix is shorter -> rank = -1
            int ra = a + k < n ? rank_[a + k] : -1;
            int rb = b + k < n ? rank_[b + k] : -1;
            return ra < rb;
        };
        sort(sa.begin(), sa.end(), cmp);

        // Re-assign ranks based on new sorted order
        // Equal pairs get the same rank
        tmp[sa[0]] = 0;
        for (int i = 1; i < n; i++)
            tmp[sa[i]] = tmp[sa[i - 1]] + (cmp(sa[i - 1], sa[i]) ? 1 : 0);
        rank_ = tmp;

        // Early termination: all ranks are unique
        if (rank_[sa[n - 1]] == n - 1) break;
    }
    return sa;
}

// Kasai's algorithm: compute LCP array in O(n).
// Key insight: if suffix at position i has lcp[rank[i]] = h with
// its predecessor in SA order, then the suffix at position i+1
// has lcp >= h-1 with its predecessor.
// We iterate in text order (i = 0..n-1) and maintain h.
vector<int> build_lcp(const string& s, const vector<int>& sa) {
    int n = s.size();
    vector<int> rank_(n), lcp(n, 0);

    // Build rank (inverse of sa): rank[sa[i]] = i
    for (int i = 0; i < n; i++) rank_[sa[i]] = i;

    int h = 0; // current LCP length, can only decrease by 1 each step
    for (int i = 0; i < n; i++) {
        if (rank_[i] > 0) {
            // j = suffix that comes just before suffix i in sorted order
            int j = sa[rank_[i] - 1];
            // Extend match from current h
            while (i + h < n && j + h < n && s[i + h] == s[j + h]) h++;
            lcp[rank_[i]] = h;
            // Next iteration: h decreases by at most 1
            if (h > 0) h--;
        }
    }
    return lcp;
}

int main() {
    ios::sync_with_stdio(false);
    cin.tie(nullptr);

    string s;
    cin >> s;
    // Append sentinel: must be < all other characters in s.
    // This ensures no suffix is a prefix of another suffix,
    // which simplifies comparisons.
    s += '$';

    vector<int> sa = build_sa(s);
    vector<int> lcp = build_lcp(s, sa);

    int n = s.size();

    // Number of distinct substrings:
    // Total substrings = n*(n-1)/2 (excluding sentinel suffixes)
    // Each lcp[i] accounts for lcp[i] substrings already counted
    long long distinct = (long long)n * (n - 1) / 2;
    for (int i = 1; i < n; i++) distinct -= lcp[i];
    cout << distinct << "\n";
}

Operations Reference

The suffix array is a build-once structure -- construction dominates the cost, but every query afterward is fast. This table shows why the upfront investment pays off.

OperationTimeSpace
Build suffix array (O(nlog2n) version)O(nlog2n)O(n)
Build suffix array (O(nlogn) with radix sort)O(nlogn)O(n)
Build LCP array (Kasai)O(n)O(n)
Count distinct substringsO(n) after LCPO(1) extra
Longest repeated substringO(n) after LCPO(1) extra
Pattern search (single pattern of length m)O(mlogn)O(1) extra
LCP of arbitrary pair (sa[i],sa[j]) via RMQO(1) per queryO(nlogn) sparse table
Longest common substring (two strings)O((n+m)log2(n+m)) build + O(n+m) scanO(n+m)

Problem Patterns & Tricks

Pattern 1: Number of Distinct Substrings

The total number of substrings is n(n+1)2. Each lcp[i] represents shared prefixes between consecutive sorted suffixes -- substrings already counted. Subtract them all.

distinct=norig(norig+1)2i=1n1lcp[i]

where norig is the original string length (without sentinel) and n=norig+1. Equivalently with sentinel-inclusive n: n(n1)2lcp[i].

cpp
long long ans = 0;
for (int i = 1; i < n; i++)
    ans += (sa[i] == 0 ? n : n - sa[i]) - lcp[i];
// each suffix sa[i] contributes (n - sa[i]) substrings minus lcp[i] duplicates

CF examples: CF 802I -- Fake News (hard), SPOJ DISUBSTR.


Pattern 2: Longest Repeated Substring

The longest substring that appears at least twice is the maximum value in the LCP array. To find it, scan lcp[] for the max, then extract the substring starting at sa[argmax] with that length.

cpp
int best = *max_element(lcp.begin() + 1, lcp.end());
// The repeated substring is s.substr(sa[argmax], best)

For "longest substring appearing at least k times," slide a window of size k1 over the LCP array and take the minimum in each window. The answer is the maximum of those minima (use a monotonic queue for O(n)).

CF examples: CF 1043G -- Speckled Band, SPOJ REPEATS.


Pattern 3: Pattern Matching via Binary Search on SA

All occurrences of pattern p in s form a contiguous range in the suffix array (since SA is sorted). Binary search for the lower and upper bounds of that range. Each comparison costs O(m), total O(mlogn).

cpp
// Find range [lo, hi) in sa where s[sa[i]..] starts with pattern p
auto cmp_lo = [&](int sa_i, const string& p) {
    return s.compare(sa_i, p.size(), p) < 0;
};
auto cmp_hi = [&](const string& p, int sa_i) {
    return s.compare(sa_i, p.size(), p) > 0;
};
int lo = lower_bound(sa.begin(), sa.end(), p, cmp_lo) - sa.begin();
int hi = upper_bound(sa.begin(), sa.end(), p, cmp_hi) - sa.begin();
int count = hi - lo; // number of occurrences

CF examples: CF 149E -- Martian Strings.


Pattern 4: Longest Common Substring of Two Strings

Concatenate s1+’#’+s2+’$’ (using separators not in either string). Build SA and LCP on the concatenated string. The answer is the maximum lcp[i] where sa[i] and sa[i1] come from different original strings.

cpp
string t = s1 + '#' + s2 + '$';
auto sa = build_sa(t);
auto lcp = build_lcp(t, sa);
int sep = s1.size(); // index of '#'
int ans = 0;
for (int i = 1; i < (int)t.size(); i++) {
    bool from1_prev = sa[i - 1] < sep;
    bool from1_curr = sa[i] < sep;
    if (from1_prev != from1_curr)
        ans = max(ans, lcp[i]);
}

CF examples: CF 1526E -- Oolimry, SPOJ LCS.

  Longest common substring via concatenation + SA + LCP:

  s1 = "abcde"    s2 = "bcdef"

  Concatenated: "abcde#bcdef$"
                 ├─s1──┤ ├─s2──┤
                 0    4 5 6   10 11

  After SA + LCP construction, scan adjacent pairs:
  
  sa[i-1]  sa[i]   From    LCP
  ──────── ──────── ─────── ────
  ...
  1        6        s1, s2  4     <- "bcde" in both!
  ...

  Answer: max LCP where sa[i-1] and sa[i] come from
  different strings = 4 -> "bcde"

  Separator '#' ensures no false cross-boundary matches.

Pattern 5: SA + Sparse Table RMQ for Arbitrary LCP Queries

To find the LCP of any two suffixes sa[i] and sa[j] (not necessarily adjacent), compute min(lcp[i+1],,lcp[j]). Build a sparse table on the LCP array for O(1) range-minimum queries after O(nlogn) preprocessing.

cpp
// Sparse table over lcp[] for O(1) RMQ
vector<vector<int>> sp(LOG, vector<int>(n));
sp[0] = lcp;
for (int j = 1; j < LOG; j++)
    for (int i = 0; i + (1 << j) <= n; i++)
        sp[j][i] = min(sp[j-1][i], sp[j-1][i + (1 << (j-1))]);

auto query = [&](int l, int r) { // LCP of sa[l] and sa[r], l < r
    l++; // lcp is between consecutive, so range is [l+1..r]
    int k = __lg(r - l + 1);
    return min(sp[k][l], sp[k][r - (1 << k) + 1]);
};

CF examples: CF 873F -- Forbidden Indices.


Many problems ask "how many substrings satisfy X?" where X depends on length or lexicographic order. Group suffixes by LCP ranges. Binary search on the LCP array to find how far a common prefix extends. Often combined with monotonic stack on LCP values.

CF examples: CF 432D -- Prefixes and Suffixes, CF 427D -- Match & Catch.


Contest Cheat Sheet

+--------------------------------------------------------------+
|                  SUFFIX ARRAY CHEAT SHEET                     |
+--------------------------------------------------------------+
| WHEN TO USE:                                                  |
|   - Count distinct substrings        --> SA + LCP sum         |
|   - Longest repeated substring       --> max(lcp[])           |
|   - Pattern matching (multi-query)   --> binary search on SA  |
|   - Longest common substring (2 str) --> concat + SA + LCP    |
|   - Any "all substrings" query       --> SA + LCP + RMQ       |
+--------------------------------------------------------------+
| CONSTRUCTION:                                                 |
|   s += '$';           // sentinel, smaller than all chars     |
|   sa = build_sa(s);   // O(n log^2 n) or O(n log n)          |
|   lcp = build_lcp(s, sa);  // Kasai, O(n)                    |
+--------------------------------------------------------------+
| KEY FORMULAS:                                                 |
|   distinct substrings = n*(n-1)/2 - sum(lcp[1..n-1])         |
|   longest repeated    = max(lcp[1..n-1])                      |
|   occurrences of P    = upper_bound - lower_bound on SA       |
+--------------------------------------------------------------+
| COMPLEXITY:                                                   |
|   Build SA:   O(n log^2 n) time,  O(n) space                 |
|   Build LCP:  O(n) time,          O(n) space                  |
|   Pattern:    O(m log n) per query                            |
|   RMQ on LCP: O(n log n) build,   O(1) per query             |
+--------------------------------------------------------------+
| WATCH OUT:                                                    |
|   - Append sentinel '$' before building                       |
|   - lcp[0] is undefined (no predecessor)                      |
|   - rank[] is INVERSE of sa[]: rank[sa[i]] = i               |
+--------------------------------------------------------------+

Gotchas & Debugging

Forgetting the Sentinel

Without a sentinel character smaller than all characters in s, suffix comparisons break when one suffix is a prefix of another. Always append '$' (or character 0) and adjust your index arithmetic accordingly.

Off-by-One in LCP

lcp[i] is the LCP between sa[i-1] and sa[i]. There is no lcp[0] (no predecessor). Loops should start at i = 1.

Rank vs SA Confusion

  • sa[i] = starting index of the i-th suffix in sorted order.
  • rank[i] = position of suffix starting at index i in the sorted order.
  • They are inverses: rank[sa[i]] = i and sa[rank[i]] = i.

Mixing them up causes silent wrong answers. Print both arrays during debugging.

Sentinel in Distinct Substring Count

If you appended '$', the sentinel contributes extra suffixes. The formula n(n+1)2lcp counts substrings including those containing '$'. Adjust: use original string length, or subtract suffixes containing the sentinel.

Integer Overflow

For strings of length n105, the number of distinct substrings can reach 5×109, exceeding int. Use long long.

Debugging Tips

  • Print the sorted suffixes alongside the SA and LCP arrays for small inputs.
  • Verify rank[sa[i]] == i for all i.
  • Verify lcp[i] >= 0 and lcp[i] < n for all valid i.
  • Test on "aaaa" (all same characters) and "abcde" (all distinct) as corner cases.

Mental Traps

Trap 1: "Suffix array is just sorting suffixes -- why is it hard?"

Sorting n suffixes naively costs O(n2logn) because each comparison is O(n). The "hard" part is prefix doubling: using previous rounds' ranks to make comparisons O(1), reducing total work to O(nlog2n) or O(nlogn) with radix sort. If you think "just sort," you'll TLE on n=105.

  Naive sort vs. prefix-doubling:

  Naive:
  ┌─────────────────────────────────────┐
  │  n suffixes x O(n) per comparison   │
  │  x O(n log n) comparisons           │
  │  = O(n² log n)                      │  <- TLE for n >= 10⁴
  └─────────────────────────────────────┘

  Prefix-doubling:
  ┌─────────────────────────────────────┐
  │  O(log n) rounds                    │
  │  x O(n log n) sort per round        │
  │  x O(1) per comparison (rank pairs) │
  │  = O(n log² n)                      │  <- OK for n <= 10⁵
  └─────────────────────────────────────┘

Trap 2: "I'll compute LCP on-the-fly instead of building the array."

For a single pair, computing LCP by character comparison is fine: O(lcp) time. But for "longest repeated substring" you need max(lcp[i]) across the entire array, and for "distinct substrings" you need lcp[i]. Computing each LCP from scratch is O(n) per pair -> O(n2) total. Kasai's algorithm gives the full array in O(n) via the hhprev1 property.


Common Mistakes

  1. Kasai iteration order. "My LCP array had garbage values for some positions, and distinct-substring count was way off..." In Kasai's algorithm, iterating in suffix-array order (for i in sa[0], sa[1], ...) instead of text order (for i in 0, 1, ..., n-1) breaks the amortization. Kasai's algorithm exploits the fact that lcp[rank[i]] >= lcp[rank[i-1]] - 1 -- this inequality only holds when iterating by text position. Iterating in SA order produces wrong values.

Debug Drills

Drill 1 -- Kasai's algorithm: wrong iteration order

cpp
int k = 0;
for (int i = 0; i < n; i++) {
    int j = sa[rank[i] - 1];  // BUG: no bounds check
    while (s[i + k] == s[j + k]) k++;
    lcp[rank[i]] = k;
    if (k) k--;
}
What's wrong?

When rank[i] == 0, rank[i] - 1 = -1 is out of bounds. The first suffix in sorted order has no predecessor, so its LCP is 0. Add a guard:

cpp
if (rank[i] == 0) { k = 0; continue; }

Drill 2 -- Suffix array sort: not breaking ties correctly

cpp
// Doubling step: sort by (rank[i], rank[i + w])
sort(sa.begin(), sa.end(), [&](int a, int b) {
    return rank[a] < rank[b] || 
           (rank[a] == rank[b] && rank[a + w] < rank[b + w]);  // BUG
});
What's wrong?

When a + w >= n or b + w >= n, rank[a + w] is out of bounds. Suffixes that are shorter than w remaining characters should be treated as having rank -1 (or INT_MIN):

cpp
auto second = [&](int x) { return x + w < n ? rank[x + w] : -1; };
sort(sa.begin(), sa.end(), [&](int a, int b) {
    return rank[a] < rank[b] || (rank[a] == rank[b] && second(a) < second(b));
});

Drill 3 -- Distinct substring count: off-by-one with LCP sum

cpp
long long total = (long long)n * (n + 1) / 2;
for (int i = 0; i < n; i++)  // BUG: LCP[0] is usually 0 or undefined
    total -= lcp[i];
What's wrong?

Depending on convention, lcp[0] might be undefined (it's the LCP between the "suffix before the first" and the first -- which doesn't exist). Typically lcp[0] = 0 by definition, so this may or may not be a bug. But if your LCP array is 1-indexed (common in some templates), you should iterate from i = 1. Verify your template's convention.


Self-Test

Before moving to problems, verify you can do these without looking at the notes:

  • [ ] Implement O(nlog2n) SA construction from memory: initialize ranks from characters, iterate doubling rounds, sort by (rank[i], rank[i+k]) pairs, re-rank, early-terminate when all ranks are distinct
  • [ ] Implement Kasai's LCP algorithm from memory, including the inverse-SA computation and the h-- step
  • [ ] State the inverse relationship rank[sa[i]] = i and verify on "banana$" -- explain what bug arises from confusing sa and rank
  • [ ] Compute distinct substrings from SA + LCP: n(n+1)2lcp[i], and explain why lcp[0] is excluded
  • [ ] Describe how to find all occurrences of pattern P in string S via binary search on the suffix array, and state the per-query complexity

Practice Problems

CF RatingWhat You Should Be Able to Do
1200Understand what a suffix array is conceptually; sort suffixes by brute force.
1500Implement O(nlog2n) suffix array construction; compute the LCP array using Kasai's algorithm.
1800Use SA + LCP + RMQ for pattern search, counting distinct substrings, and longest common substring of two strings.
2100Combine SA + LCP with monotonic stack, binary search, or segment tree for advanced problems (e.g., sum of distinct substring lengths, LCP-based grouping).
#ProblemSourceDifficultyKey Idea
1Distinct SubstringsSPOJ1400SA + LCP, count distinct substrings
2Longest Common SubstringSPOJ1500Concatenate two strings, SA + LCP scan
3String MatchingCSES1500Binary search on SA for pattern
4Prefixes and SuffixesCF 432D1900SA + LCP + counting
5Match & CatchCF 427D2000SA + LCP on concatenated strings
6Forbidden IndicesCF 873F2100SA + LCP + sparse table RMQ
7Martian StringsCF 149E2300SA + LCP, substring search
8RepeatsSPOJ2200SA + LCP + binary search on length
9Speckled BandCF 1043G2400SA + LCP, heavy case analysis
10Fake News (hard)CF 802I2300SA + LCP + monotonic stack

Further Reading

Built for the climb from Codeforces 1100 to 2100.