Skip to content

Aho-Corasick

Quick Revisit

  • USE WHEN: Searching for multiple patterns simultaneously in a single text scan
  • INVARIANT: Failure link of node v points to the longest proper suffix of v's string that is a prefix of some pattern in the trie
  • TIME: O(n + m + z) where n = text length, m = total pattern length, z = matches | SPACE: O(m × σ)
  • CLASSIC BUG: Not following the dictionary suffix link chain to report all matching patterns at each position (misses overlapping shorter patterns)
  • PRACTICE: 07-ladder-strings

AL-37 | Multi-pattern string matching in linear time -- build a trie, thread failure links through it, and scan the text once to find all occurrences of all patterns in O(n+m+z).

Difficulty: [Advanced]Prerequisites: KMP and Z-Function, TrieCF Rating Range: 1900 -- 2200+

Contents


Intuition

You have three patterns -- "he", "she", "his" -- and a text "ahishers". The brute-force plan: run KMP (or any single-pattern matcher) once per pattern. Each run costs O(n), so the total is O(kn). For k=3 this is fine. For k=105 patterns against a text of length n=106, you are looking at 1011 operations -- far too slow.

We need a way to search for all patterns in a single pass through the text.

Build a trie of all patterns, then add failure links (like KMP's failure function but across patterns) -- now a single pass through the text finds all pattern occurrences simultaneously.

KMP is a single highway: one pattern, one failure function, one automaton. Aho-Corasick is an interstate system with on-ramps between patterns. Each failure link is an on-ramp that says "the suffix you just read also happens to be a prefix of this other pattern -- jump there and keep driving." By threading all these on-ramps through the shared trie, one scan of the text rides every highway at once.

Visual Walkthrough

Patterns: "he", "she", "his". Text: "ahishers".

Step 1 -- Build the trie.

              (root)
             /      \
            h        s
            |        |
           [1]      [4]
           / \       |
          e   i      h
          |   |      |
         [2] [3]    [5]
        {he}  |      |
              s      e
              |      |
             [*]    [6]
            {his}  {she}

Nodes marked {...} are pattern-end nodes.

  TRIE STRUCTURE after inserting "he", "she", "his":

      root
      ├──'h'──-> [h]
      │          ├──'e'──-> [he]OK
      │          └──'i'──-> [hi]
      │                    └──'s'──-> [his]OK
      └──'s'──-> [s]
               └──'h'──-> [sh]
                         └──'e'──-> [she]OK

  Nodes marked OK = end of a pattern.
  Total trie nodes: 7 (excluding root).

Step 2 -- Compute failure links (BFS from root).

Failure links point to the longest proper suffix of the node's string that exists as a prefix of some node in the trie.

  Node  String  Fail     Notes
  ----  ------  ----     -----
  root  ""      root
  [1]   "h"     root     no proper suffix "h" elsewhere
  [4]   "s"     root
  [2]   "he"    root     "e" is not in the trie as a prefix
  [3]   "hi"    root
  [5]   "sh"    [1]      suffix "h" = node [1]
  [*]   "his"   root     "is","s" not in trie as prefixes
  [6]   "she"   [2]      suffix "he" = node [2] {he}  <-- dictionary suffix link!

The critical link: [6] "she" --fail--> [2] "he". Whenever we match "she", the failure chain also reaches the node for "he", so both matches are reported.

Step 3 -- Scan text "ahishers" through the automaton.

  Pos  Char  State   Action
  ---  ----  -----   ------
   0    a    root    no edge 'a' from root -> stay at root
   1    h    [1]     edge 'h' from root
   2    i    [3]     edge 'i' from [1]="h"
   3    s    [*]     edge 's' from [3]="hi"
                     ** MATCH "his" ending at pos 3 **
   4    h    [1]     fail from [*] has no 'h'; root has 'h' -> [1]
   5    e    [2]     edge 'e' from [1]="h"
                     ** MATCH "he" ending at pos 5 **
   6    r    root    no edge 'r' from [2]; fail to root; no 'r' -> root
   7    s    [4]     edge 's' from root

Step 4 -- Collect results.

Wait -- did we miss "she"? Let us re-examine. In "ahishers", the substring "she" starts at position 4: a h i s h e r s. Positions 4-5-6 are h e r, not s h e. In fact "she" does not appear in "ahishers". The automaton is correct: it found "his" (positions 1..3) and "he" (positions 4..5). Two matches, one pass, O(n+m) work.

Step 5 -- Why failure links matter.

After matching "his" at node [*], the next character is 'h'. Node [*] has no child 'h', so the automaton follows the failure link to root, finds edge 'h', and lands at [1]. Without failure links we would have to restart from the root and rescan -- losing the overlap. The links make the automaton never backtrack in the text.

Overlapping patterns: {he, she, hers} over text "ushers"

The walkthrough above did not actually exercise the dictionary suffix link, because "ahishers" happens not to contain "she". Here is the canonical small example that does. Patterns: he, she, hers. Text: "ushers".

  Trie:                            Failure links (only the interesting ones):
        root
       /    \                       sh   --fail-> h    (suffix "h" of "sh")
      h      s                      she* --fail-> he*  (suffix "he" of "she")
     / \      \                     her  --fail-> root
    e   r      h                    hers*--fail-> s    (suffix "s" of "hers")
    *   |      \
        s       e                   Dictionary suffix links:
        *       *                   she* --dict-> he*  (because he* IS a pattern,
                                                       reachable via the fail chain)
                                    others --dict-> nothing useful
  Stars (*) mark pattern-end nodes.

  Text scan ("ushers"):
    Pos  Char  State          Outputs at this position
    ---  ----  -------------  ---------------------------------------
     0   u     root           --
     1   s     [s]            --
     2   h     [sh]           --
     3   e     [she]*         "she" (state itself is terminal)
                              -> follow dict link to [he]*: also report "he"
     4   r     [her]          --
     5   s     [hers]*        "hers" (terminal)
                              -> follow dict link: nothing further

  Reported matches:  she @ pos 1..3
                     he  @ pos 2..3   <-- found ONLY via the dict suffix link
                     hers@ pos 2..5

The line that matters is at position 3. The automaton state [she] is itself a pattern, so we report "she". But the dictionary suffix link from [she] jumps to [he], which is also a pattern -- so without traversing dict links we would silently miss "he". This is the entire reason dictionary suffix links exist as a separate notion from failure links: failure links find the right state to be in, dictionary suffix links find every pattern that ends at the current text position.

The Invariant

Invariant. At position i in the text, the automaton state represents the longest suffix of text[0..i] that is a prefix of some pattern in the trie. Failure links point from each node to the longest proper suffix of that node's string that is also a node in the trie. Dictionary suffix links further point to the nearest ancestor (via the failure chain) that marks a pattern end, ensuring no match is missed.

What the Code Won't Teach You

The 40-line AC template hides three conceptual layers that you must understand separately:

  1. Why BFS order matters for construction. Failure links are computed in BFS order because fail[v] depends on fail[parent[v]], which must already be computed (it's shallower). If you compute in DFS order, fail[parent[v]] might not exist yet.

  2. The automaton is a DFA -- every state has a transition for every character. The go[v][c] array has no holes. Missing trie edges are filled by following fail links during BFS. This is what makes text scanning O(|text|) -- no fail-chain chasing at query time.

  3. Dictionary suffix links are a separate optimization. They shortcut the fail chain to the nearest complete pattern. Without them, collecting all matches at a position requires walking the entire fail chain (potentially O(depth) per character). With them, you jump directly to relevant matches.

  Layer 1: TRIE              Layer 2: + FAIL LINKS       Layer 3: + OUTPUT LINKS
  ┌─────────────┐            ┌─────────────────┐         ┌──────────────────────┐
  │ Insert all   │            │ BFS to compute   │         │ For each node, cache │
  │ patterns     │     ->      │ fail[v] for all  │    ->    │ nearest pattern via  │
  │ into trie    │            │ nodes in BFS     │         │ fail chain           │
  │              │            │ order; fill all  │         │                      │
  │ O(M) time    │            │ go[v][c] edges   │         │ O(M) time            │
  │              │            │ O(M*Σ) time      │         │                      │
  └─────────────┘            └─────────────────┘         └──────────────────────┘

When to Reach for This

Trigger phrases in problem statements:

  • "multiple pattern matching" / "dictionary of patterns in a text"
  • "banned substrings" / "forbidden words"
  • "how many of the given strings appear in T?"
  • "count occurrences of each pattern"
  • "DP avoiding a set of forbidden patterns" (AC automaton + DP)

Counter-examples -- when not to use AC:

  • Single pattern -- use KMP or Z-function. AC adds unnecessary trie overhead.
  • Patterns are all substrings of each other (e.g. "a", "aa", "aaa") -- AC still works but the dictionary-link walk becomes O(k) per character; use the push-down technique (see Section 6, Pattern 4) or consider suffix structures.
  • You need suffix/substring structure of the text itself -- suffix array or suffix automaton is the right tool.

C++ STL Reference

Function / ClassHeaderWhat it doesTime Complexity
queue<int><queue>BFS queue for computing failure linksO(1) push/pop
array<int, 26><array>Fixed-size child pointer array per trie nodeO(1) access
vector<int><vector>Storing node data, pattern indices, output listsO(1) amortized push
string::operator[]<string>Character access during text scanO(1)
fill(first, last, val)<algorithm>Initialize child arrays to 1O(Σ)
memset<cstring>Fast bulk initialization of node arraysO(Σ)

Note: AC automaton is a hand-built data structure. The STL provides only basic containers -- there is no std::aho_corasick. The table above lists the STL pieces you'll compose into the implementation.


Implementation (Contest-Ready)

Failure link (suffix link): Points to the longest proper suffix of the current string that is also a prefix of some pattern. Used to navigate the automaton on mismatch.

Output link (dictionary suffix link): Points to the longest proper suffix of the current string that is itself a COMPLETE pattern. Used to find ALL patterns that end here.

Why you need BOTH:

  • Failure links let you traverse efficiently (like KMP's failure function).
  • Output links let you enumerate all matches. Without output links, you only find the longest matching pattern at each position, missing shorter ones.

Precompute all transitions: Instead of following failure links at runtime, precompute go[v][c] for every node and character:

cpp
// In BFS order:
if (ch is a child of v) {
    go[v][c] = child;
} else {
    go[v][c] = go[fail[v]][c];  // Inherit from failure link
}

This makes each step O(1) instead of following a chain of failure links.

Version 1: Minimal -- Array-Based AC Automaton

cpp
#include <bits/stdc++.h>
using namespace std;

struct AhoCorasick {
    static constexpr int ALPHA = 26;
    struct Node {
        array<int, ALPHA> ch;
        int fail = 0, end_cnt = 0, dict = -1;
        Node() { ch.fill(-1); }
    };
    vector<Node> t;

    AhoCorasick() : t(1) {}

    void insert(const string& s, int id) {
        int v = 0;
        for (char c : s) {
            int ci = c - 'a';
            if (t[v].ch[ci] == -1) {
                t[v].ch[ci] = (int)t.size();
                t.emplace_back();
            }
            v = t[v].ch[ci];
        }
        t[v].end_cnt++;
        t[v].dict = id;
    }

    void build() {
        queue<int> q;
        for (int c = 0; c < ALPHA; c++) {
            if (t[0].ch[c] == -1)
                t[0].ch[c] = 0;
            else {
                t[t[0].ch[c]].fail = 0;
                q.push(t[0].ch[c]);
            }
        }
        while (!q.empty()) {
            int v = q.front(); q.pop();
            for (int c = 0; c < ALPHA; c++) {
                int u = t[v].ch[c];
                int f = t[t[v].fail].ch[c];
                if (u == -1) {
                    t[v].ch[c] = f;
                } else {
                    t[u].fail = f;
                    // dictionary suffix link: nearest ancestor via fail with end_cnt > 0
                    t[u].dict = (t[f].end_cnt > 0) ? f : t[f].dict;
                    q.push(u);
                }
            }
        }
    }

    // Count total matches in text. Modify for specific needs.
    long long search(const string& text) {
        long long cnt = 0;
        int v = 0;
        for (char c : text) {
            v = t[v].ch[c - 'a'];
            for (int u = v; u > 0; u = t[u].dict) {
                if (t[u].end_cnt > 0)
                    cnt += t[u].end_cnt;
            }
        }
        return cnt;
    }
};

int main() {
    ios::sync_with_stdio(false);
    cin.tie(nullptr);

    int n;
    cin >> n;
    AhoCorasick ac;
    for (int i = 0; i < n; i++) {
        string s;
        cin >> s;
        ac.insert(s, i);
    }
    ac.build();

    string text;
    cin >> text;
    cout << ac.search(text) << "\n";
}

Version 2: Explained -- With Per-Pattern Counting

cpp
#include <bits/stdc++.h>
using namespace std;

struct AhoCorasick {
    static constexpr int ALPHA = 26;

    struct Node {
        array<int, ALPHA> ch;   // ch[c] = index of child node for character c
        int fail = 0;           // failure (suffix) link
        int dict = -1;          // dictionary suffix link: nearest pattern-end ancestor via fail chain
        int pat_id = -1;        // which pattern ends here (-1 if none)
        Node() { ch.fill(-1); }
    };

    vector<Node> t;
    int num_patterns = 0;

    AhoCorasick() : t(1) {}

    // Insert pattern s with index id into the trie.
    // Time: O(|s|).
    void insert(const string& s, int id) {
        int v = 0;
        for (char c : s) {
            int ci = c - 'a';
            if (t[v].ch[ci] == -1) {
                t[v].ch[ci] = (int)t.size();
                t.emplace_back();
            }
            v = t[v].ch[ci];
        }
        // If multiple patterns are identical, only the last id is stored.
        // For multi-id, use a vector<int> or linked list.
        t[v].pat_id = id;
        num_patterns = max(num_patterns, id + 1);
    }

    // Build failure links and dictionary suffix links via BFS.
    // After build(), t[v].ch[c] is defined for ALL (v, c) pairs --
    // missing edges are filled in to point to the correct fail-chain node,
    // turning the trie into a full automaton (no explicit fail-following needed during search).
    // Time: O(m * ALPHA) where m = total pattern length = number of trie nodes.
    void build() {
        queue<int> q;
        // Initialize root's children: missing edges loop back to root.
        for (int c = 0; c < ALPHA; c++) {
            if (t[0].ch[c] == -1) {
                t[0].ch[c] = 0;  // missing edge -> back to root
            } else {
                t[t[0].ch[c]].fail = 0;
                q.push(t[0].ch[c]);
            }
        }
        while (!q.empty()) {
            int v = q.front(); q.pop();
            for (int c = 0; c < ALPHA; c++) {
                int u = t[v].ch[c];              // actual child
                int f = t[t[v].fail].ch[c];      // where fail link's child goes
                if (u == -1) {
                    // No real child for c -- automaton edge goes to f.
                    t[v].ch[c] = f;
                } else {
                    // Real child exists. Its failure link = f.
                    t[u].fail = f;
                    // Dictionary suffix link: nearest node in the fail chain that ends a pattern.
                    t[u].dict = (t[f].pat_id >= 0) ? f : t[f].dict;
                    q.push(u);
                }
            }
        }
    }

    // Search text and count occurrences of each pattern.
    // Returns a vector cnt where cnt[id] = number of occurrences of pattern id.
    // Time: O(|text| + z), where z = total number of matches.
    vector<int> search(const string& text) {
        vector<int> cnt(num_patterns, 0);
        int v = 0;
        for (char c : text) {
            v = t[v].ch[c - 'a'];  // automaton transition -- always valid after build()
            // Walk dictionary suffix links to collect all patterns ending here.
            for (int u = v; u > 0; u = t[u].dict) {
                if (t[u].pat_id >= 0)
                    cnt[t[u].pat_id]++;
            }
        }
        return cnt;
    }
};

int main() {
    ios::sync_with_stdio(false);
    cin.tie(nullptr);

    int n;
    cin >> n;
    AhoCorasick ac;
    vector<string> patterns(n);
    for (int i = 0; i < n; i++) {
        cin >> patterns[i];
        ac.insert(patterns[i], i);
    }
    ac.build();

    string text;
    cin >> text;

    auto cnt = ac.search(text);
    for (int i = 0; i < n; i++)
        cout << patterns[i] << ": " << cnt[i] << "\n";
}

ASCII Trace: Aho-Corasick Automaton for

STEP 1 -- Build Trie (insert patterns):
                    (root)
                   /      \
                  h         s
                 / \         \
                e*  i         h
                |    \         \
                r     s*        e*
                |
                s*

  Nodes: 0=root, 1=h, 2=he*, 3=hi, 4=his*, 5=her, 6=hers*, 7=s, 8=sh, 9=she*
  (* = output node -- marks end of a pattern)

STEP 2 -- Build Failure Links (BFS from root):
  Node    Fail    Reason
  ──────────────────────────────────────────────
  h  ->    root    no proper suffix of "h" in trie
  s  ->    root    no proper suffix of "s" in trie
  he ->    root    suffix "e" not in trie
  hi ->    root    suffix "i" not in trie
  sh ->    h       suffix "h" is node 1
  he*r ->  root    suffix "r" not in trie
  hi*s ->  s       suffix "s" is node 7
  sh*e ->  he      suffix "he" is node 2  -> dict link to "he"*
  her*s -> s       suffix "s" is node 7

  Failure links (── ▷) and dictionary links (══▷):
        root ◁── h ◁── he* ◁── her ◁── hers*
               ▲               │
               │               ▼ (fail)
        root ◁── s ◁── sh ──▷ (fail->h)

                        she* ══▷ he* (dict link: also outputs "he")

STEP 3 -- Search text "ahishers":
  Pos  Char  State    Output
  ────────────────────────────────────────
   0    a    root     -
   1    h    h        -
   2    i    hi       -
   3    s    his*     match "his" at pos 1
   4    h    h        -  (fail from his->s->root, then 'h'->h)
   5    e    he*      match "he" at pos 4
   6    r    her      -
   7    s    hers*    match "hers" at pos 4
                      + dict follow -> match "she" at pos 3? No.
                      (via fail: hers->s, no further output)

  Matches found: "his"@1, "he"@4, "hers"@4
  (Note: "she"@1 also matched via she*->he* dict link when processing pos 3
   in a full implementation that checks all suffix outputs.)
  Solid arrows (->) = trie edges     Dashed arrows (⇢) = failure links

             ┌─────── root ───────┐
             │ (fail -> self)      │
             ▼                    ▼
       ┌─── h ──┐               s
       │        │               │
       ▼        ▼               ▼
      he*       hi         sh *****⇢ h  (fail)
       *         │              │
       *         ▼              ▼
       *        his*       she* ****⇢ he* (fail + dict link)
       *
  (he fails -> root)

  On mismatch at any node, follow its failure link and retry.
  Dictionary links chain output-producing failure ancestors.

Operations Reference

Each phase of the AC pipeline has a distinct cost profile -- consult this table to budget time and memory before coding.

OperationTimeSpace
Insert one pattern of length lO(l)O(lΣ) worst case
Insert all k patterns, total length mO(m)O(mΣ)
Build failure links (BFS)O(mΣ)O(m)
Search text of length n (total match count)O(n+z)O(1) extra
Search text (per-pattern count)O(n+z)O(k)
Full pipeline: insert + build + searchO(mΣ+n+z)O(mΣ)

Σ = alphabet size (26 for lowercase). z = number of matches. m = sum of pattern lengths. n = text length. k = number of patterns.

Memory note: With Σ=26 and int children, each node uses 26×4=104 bytes. For m=106 nodes, that is 100 MB -- tight. Consider using a flat 2D array int ch[MAXNODES][26] to avoid vector overhead, or reduce to per-problem alphabet size.


Problem Patterns & Tricks

Pattern 1: Multi-Pattern Search (Direct)

Given a text and k patterns, determine which patterns appear and/or count occurrences. Direct application of AC.

Approach: Insert all patterns, build, scan text, collect matches via dictionary suffix links.

Examples: CF 1400F -- x-Prime Substrings, CF 710F -- String Set Queries


Pattern 2: DP on AC Automaton

Build the AC automaton as states. Run DP where dp[i][v] = "best value after processing i characters while at automaton state v." Transitions follow automaton edges.

Typical problem: "Count strings of length n over alphabet {a..z} that contain none (or at least one) of the forbidden patterns."

  AC AUTOMATON AS DP STATE SPACE:

  Problem: "Count strings of length N over {a,b} avoiding patterns {ab, ba}"
  AC automaton states: {root, a, b, [ab]X, [ba]X}

  DP[i][state] = # strings of length i ending in 'state'

  Transitions (only non-forbidden states):
  ┌───────┐  'a'   ┌───┐  'b'   ┌────┐
  │ root  │──────-> │ a │──────-> │ ab │X (forbidden)
  │       │  'b'   ├───┤  'a'   ├────┤
  │       │──────-> │ b │──────-> │ ba │X (forbidden)
  └───────┘        └───┘        └────┘

  Only walk transitions where target is NOT a forbidden state.
  Answer = Σ DP[N][non-forbidden states]
cpp
// dp[i][v] = number of strings of length i ending at automaton state v
// Transition: dp[i+1][t[v].ch[c]] += dp[i][v]  for each c in [0, ALPHA)
// Skip states that lie on a pattern-end path (if counting avoidance).

Examples: CF 1207G -- Indie Album, CF 86C -- Genetic Engineering


Pattern 3: AC Automaton + Bitmask DP

When k is small (k20), track which patterns have been matched using a bitmask. dp[v][mask] = "at automaton state v, having matched the patterns indicated by mask."

Goal: Find the shortest string that contains all k patterns as substrings (shortest common superstring variant).

Examples: CF 533F -- Encoding


Pattern 4: Counting Matches Efficiently (Push-Down Technique)

When z (total matches) is huge, walking dictionary suffix links at every text position becomes O(nk) in the worst case. Instead, during the scan only increment a counter at the current node. After the scan, push counts from children to parents via the fail tree (reverse BFS order) in O(m).

cpp
// During search: only do cnt[v]++ (don't walk dict links)
// After search: build BFS order, then push counts in reverse
vector<int> order;
order.reserve(t.size());
queue<int> bfs;
for (int c = 0; c < 26; c++)
    if (t[0].ch[c] != 0) bfs.push(t[0].ch[c]);
while (!bfs.empty()) {
    int v = bfs.front(); bfs.pop();
    order.push_back(v);
    for (int c = 0; c < 26; c++)
        if (t[v].ch[c] != t[t[v].fail].ch[c])
            bfs.push(t[v].ch[c]);
}
// Process nodes in reverse BFS order (deepest first)
for (int i = (int)order.size() - 1; i >= 0; i--)
    cnt[t[order[i]].fail] += cnt[order[i]];
// Now cnt[v] = total times the automaton visited v or any descendant in fail tree

This reduces worst-case search from O(nk) to O(n+m).

Examples: CF 710F -- String Set Queries, CF 963D -- Frequency of String


Pattern 5: Online / Dynamic Pattern Set

Patterns are added or removed between queries. Rebuild the AC automaton after each batch of updates. For q queries, batch and rebuild in O((m+n)q) amortized, or rebuild fully each time if the pattern set is small enough.

Examples: CF 710F -- String Set Queries


Pattern 6: AC on Tree (Euler Tour + Offline)

Given strings on tree edges/nodes, answer "how many times does pattern pi appear in the root-to-node-v path string?" Map each pattern to an AC automaton node. During DFS on the tree, maintain the current automaton state. Use the fail tree + Euler tour + BIT to answer subtree queries in O((n+m)logm).

Examples: CF 1207G -- Indie Album, CF 547E -- Mike and Friends


Contest Cheat Sheet

+------------------------------------------------------------------+
|                   AHO-CORASICK CHEAT SHEET                       |
+------------------------------------------------------------------+
| WHEN TO USE:                                                     |
|   - Multiple pattern search in a single text                     |
|   - "Which of these k patterns appear?" / count occurrences      |
|   - DP over string states avoiding/requiring patterns            |
|   - k * |text| is too slow (need linear scan)                    |
+------------------------------------------------------------------+
| BUILD:                                                           |
|   1. Insert all patterns into trie         O(m)                  |
|   2. BFS to compute fail + dict links      O(m * ALPHA)          |
|   3. Scan text, follow automaton edges      O(n + z)             |
+------------------------------------------------------------------+
| KEY FORMULAS:                                                    |
|   Nodes <= m + 1    (m = sum of pattern lengths)                 |
|   Memory ~ m * ALPHA * 4 bytes                                   |
|   Time:  O(m * ALPHA + n + z)                                    |
+------------------------------------------------------------------+
| FAIL LINK:  longest proper suffix of current path in trie        |
| DICT LINK:  nearest pattern-end reachable via fail chain         |
| AUTOMATON:  after build, ch[v][c] always valid (no -1)           |
+------------------------------------------------------------------+
| OPTIMIZE:  push-down on fail tree to avoid O(n*k) dict walks     |
+------------------------------------------------------------------+

Gotchas & Debugging

Failure links alone find only the longest pattern ending at each position. If "she" and "he" are both patterns and you match "she", the failure link from the "she" node goes to the "he" node. Without dictionary suffix links (or walking the full fail chain), you miss "he". Always walk dict links or use the push-down technique.

Overlapping identical patterns

If pattern "abc" is inserted twice, you need end_cnt (a counter) at the node, not just a boolean flag. Otherwise you undercount.

Missing automaton completion

After build(), every t[v].ch[c] must be defined (no 1 remaining). The BFS fills missing edges to point along the fail chain. If you skip this, following an edge during search can crash or silently go to state 1.

Alphabet mismatch

If patterns use characters outside [a-z], the c - 'a' mapping produces negative or out-of-range indices. Either remap characters or use unordered_map<char,int> per node (much slower).

Memory limit exceeded

With m=5×105 and Σ=26, naive vector<Node> with array<int,26> per node uses 52 MB. On tight memory limits (256 MB), use a flat 2D array:

cpp
int ch[MAXNODES][26];
int fail[MAXNODES], dict[MAXNODES], pat_id[MAXNODES];

This eliminates per-vector-element overhead and is cache-friendlier.

  MEMORY LAYOUT COMPARISON:

  Vector of structs (high overhead):     Flat arrays (cache-friendly):
  ┌────────────────────────┐             ┌────────────────────────────┐
  │ struct Node {          │             │ int ch[MAXN][26];   // 26  │
  │   array<int,26> ch;    │  xMAXN     │ int fail[MAXN];            │
  │   int fail, dict;      │  nodes     │ int dict[MAXN];            │
  │   int pat_id;          │             │ int pat_id[MAXN];          │
  │ };                     │             │                            │
  │ sizeof ~= 120 bytes     │             │ sizeof per node ~= 116 B    │
  │ + vector overhead      │             │ but contiguous -> fast      │
  └────────────────────────┘             └────────────────────────────┘
  M=5x10⁵ -> ~72 MB + overhead           M=5x10⁵ -> ~58 MB, no overhead

Text "aaa...a" with patterns "a", "aa", "aaa", ..., "a^k" causes O(k) dict link walks per text character. Total: O(nk). Use the push-down technique (Pattern 4) to reduce to O(n+m).

Debug checklist

  1. Print trie: for each node, print its string, fail target, dict target, pat_id.
  2. Verify build() leaves no ch[v][c] == -1.
  3. Test with a single pattern (should degrade to KMP behavior).
  4. Test with identical patterns.
  5. Test with empty text and empty pattern set.

Mental Traps

Trap 1: "I built the trie, so I'm done." Building the trie is step 1 of 3. Without failure links (step 2) and automaton completion (step 3), you have a structure that can only match patterns starting at position 0.

  INCOMPLETE SETUP -- what many beginners build:

  Trie only (no fail links):          With fail + automaton:
  root──h──e(OK)                       root──h──e(OK)──r──s(OK)
      │                                   │  ↑fail    ↑fail
      └──s──h──e(OK)                       └──s──h──e(OK)
                                              ↑fail│  ↑fail=h->e
                                              root │

                                          go[s-node]['h'] -> h-node
                                          (automaton fills ALL edges)

Trap 2: "Output links and failure links are the same thing." Failure link = where to go on mismatch (longest suffix that's a trie prefix). Output link = nearest ancestor via fail chain that is a complete pattern. Confusing them means you find mismatches but miss overlapping matches.

  Patterns: {"he", "she", "hers"}

  At node for "she":
    fail["she"]   -> "e" node  (longest suffix in trie)
    output["she"] -> "he" node (nearest COMPLETE pattern via fail chain)

  If you only follow fail links for matches:
    X You land on "e" -- NOT a complete pattern -- and stop.
    X You miss that "he" also matched.

  If you follow output links:
    OK output["she"] = "he" node -> report "he" match
    OK output["he"] = null -> done

Trap 3: "AC is overkill -- I'll just run K separate KMP searches." For K=3, sure. For K=10⁵ patterns with total length M=10⁶ against text of length N=10⁶, KxKMP = O(KN) = 10¹¹. AC = O(N+M) = 210⁶. Know the crossover point.


Common Mistakes

  1. Missing dictionary suffix links during output. "My AC automaton found some patterns but missed others that definitely appear in the text..." You computed failure links correctly but forgot to follow dictionary suffix links during the output phase. At a node v, checking if v itself is terminal is not enough -- you must also walk the fail chain to find shorter patterns that also end at this position. Fix: maintain a dict_suffix_link[v] (the nearest terminal ancestor in the fail chain) or, after scanning, propagate counts from children to parents in the fail tree.

Debug Drills

Drill 1 -- BFS for failure links skips the root's children

cpp
queue<int> q;
q.push(0);  // root
while (!q.empty()) {
    int v = q.front(); q.pop();
    for (auto& [c, u] : trie[v].children) {
        trie[u].fail = (v == 0) ? 0 : trie[trie[v].fail].children[c];
        q.push(u);
    }
}
What's wrong?

When computing trie[u].fail for a non-root parent, trie[trie[v].fail].children[c] might not exist. You need to follow the fail chain until you find a node that has child c, or reach root. The correct approach: precompute the go[v][c] function (automaton transition) that handles fail-link following, then fail[u] = go[fail[v]][c].

Drill 2 -- Not marking output via fail chain

cpp
// Scanning text
int v = 0;
for (char c : text) {
    v = go[v][c];
    if (trie[v].is_terminal)
        results.push_back(trie[v].pattern_id);
    // Missing: check fail chain for shorter pattern matches
}
What's wrong?

At node v, there might be shorter patterns ending at the same position -- reachable via the fail chain. Add:

cpp
for (int u = v; u != 0; u = trie[u].dict_suffix_link) {
    if (trie[u].is_terminal)
        results.push_back(trie[u].pattern_id);
}

Or use the dict_suffix_link to jump directly to the nearest terminal ancestor.

Drill 3 -- Automaton transition function missing fail fallback

cpp
int go(int v, int c) {
    if (trie[v].children.count(c))
        return trie[v].children[c];
    return 0;  // BUG: should follow fail link, not jump to root
}
What's wrong?

When node v has no child for character c, you should follow fail[v] and try again (recursively), not jump directly to root. The standard approach precomputes go[v][c] during BFS:

cpp
if (trie[v].children.count(c))
    go[v][c] = trie[v].children[c];
else
    go[v][c] = (v == 0) ? 0 : go[trie[v].fail][c];

Self-Test

Before moving to the practice problems, you should be able to answer these without looking anything up:

  • [ ] Given patterns {"ab", "b", "bc"}, draw the trie, compute all failure links via BFS, and fill in automaton transitions for every (state, character) pair.
  • [ ] Trace the AC automaton on text "xabcy" with the patterns above -- list every state visited and every pattern match reported (including via output links).
  • [ ] Explain why the root's failure link must point to itself, and what breaks if it doesn't.
  • [ ] State the time complexity of AC construction and text scanning separately, in terms of total pattern length M, text length N, alphabet size Σ, and number of matches Z.
  • [ ] Describe the push-down technique for counting all pattern occurrences in O(N + M) without walking output links during scanning.
  • [ ] Give an example where running K separate KMP instances is faster than AC (small K), and one where AC dominates (large K).

Practice Problems

CF RatingWhat You Should Be Able to Do
1200Search for a single pattern using KMP or find(); understand what a trie is.
1500Build a trie of patterns and add failure (suffix) links via BFS; scan a text to find all pattern occurrences.
1800Use dictionary suffix links for efficient output; combine AC with DP on the automaton states.
2100DP on the AC automaton with bitmask or counting states; combine with fail-tree properties (subtree queries, Euler tour).
#ProblemSourceDifficultyKey Idea
1Aho-Corasick (EDU)CF EDU1600Step-by-step introduction, build and query
2String Set QueriesCF 710F1900Dynamic pattern set, rebuild AC per batch
3Frequency of StringCF 963D2000Per-pattern occurrence positions, push-down
4Genetic EngineeringCF 86C2100DP on AC automaton, coverage constraints
5Indie AlbumCF 1207G2200AC on tree + Euler tour + BIT
6Mike and FriendsCF 547E2200Suffix automaton / AC + fail tree queries
7x-Prime SubstringsCF 1400F2300AC + DP, avoid forbidden substrings
8SleepingCF 519E2100AC automaton state DP with modular counting
9Finding PatternsCSES1800Multi-pattern existence check
10Counting PatternsCSES1900Count occurrences of each pattern
11Pattern PositionsCSES1900Find first occurrence position of each pattern

Further Reading

Built for the climb from Codeforces 1100 to 2100.