Skip to content

String Hashing

AL-34 | Map strings to integers via polynomial hashing -- compare substrings in O(1) after O(n) preprocessing. A pragmatic comparator that wins most contest string problems, as long as you remember it is a fingerprint, not a proof.

Difficulty: [Intermediate]Prerequisites: Arrays and Strings, Math FundamentalsCF Rating Range: 1400 -- 1800

Contest Frequency: ** Regular -- common in string comparison problems


Quick Revisit

  • USE WHEN: fast substring comparison, counting distinct substrings, or string matching
  • INVARIANT: hash(s[l..r]) computed in O(1) from prefix hashes and precomputed powers
  • TIME: O(n) preprocess, O(1) per query | SPACE: O(n)
  • CLASSIC BUG: single hash causes collisions -- always use double hashing (two mod values)
  • PRACTICE: Strings Practice Ladder

Contents

Intuition

Suppose you want to check whether a pattern p of length m=5 appears inside a text s of length n=20.

The naive approach: slide p over every position i in s and compare character by character.

  s: [a b c a b c d a b c a b d a b c a b c a]    n = 20
  p: [a b c a b]                                    m = 5

  Position 0:  compare s[0..4] vs p[0..4]  ->  5 comparisons
  Position 1:  compare s[1..5] vs p[0..4]  ->  up to 5 comparisons
  ...
  Position 15: compare s[15..19] vs p[0..4] ->  up to 5 comparisons
  ---------------------------------------------------------------
  Total positions: n - m + 1 = 16
  Worst-case comparisons: 16 * 5 = 80 = O(n * m)

For n=106 and m=103, that is 109 character comparisons -- too slow. We need a way to compare entire substrings in O(1).

Convert each substring to a single number (its hash) so that comparing two substrings takes O(1) instead of O(m) -- and use a polynomial rolling hash to compute all nm+1 substring hashes in O(n) total.

Analogy: Imagine you have a phone book with millions of names. Comparing two names letter-by-letter is slow. But if every name had a unique phone number, you could compare two numbers in a single step. A polynomial hash is that phone number -- it compresses an arbitrary-length string into a fixed-size integer. Two different strings might share a number (collision), but with a good hash this is astronomically unlikely.

Visual walkthrough

We hash the string s = "abcab" using base b=31 and character values a=1,b=2,c=3. Think of each string as a number in base b.

Step 1 -- Build prefix hashes left to right.

Each prefix hash absorbs one more character: h[i+1]=h[i]b+s[i].

  i:       0       1       2       3       4
  s[i]:    a(1)    b(2)    c(3)    a(1)    b(2)

  h[0] = 0
  h[1] = h[0]*31 + 1  =  0 + 1       =  1
  h[2] = h[1]*31 + 2  =  31 + 2      =  33
  h[3] = h[2]*31 + 3  =  1023 + 3    =  1026
  h[4] = h[3]*31 + 1  =  31806 + 1   =  31807
  h[5] = h[4]*31 + 2  =  986017 + 2  =  986019

Step 2 -- Extract hash of a substring in O(1) via the prefix trick.

Want hash of s[2..4] = "cab". The prefix hash h[5] encodes "abcab" and h[2] encodes "ab". Subtract the contribution of "ab" shifted into the high digits:

  hash(s[2..4]) = h[5] - h[2] * b^3
                = 986019 - 33 * 29791
                = 986019 - 983103
                = 2916       (all mod some prime p)

Verify directly: 3312+131+2=2883+31+2=2916. Correct.

Step 3 -- Slide the window: rolling from s[0..4] to s[1..5].

  hash(s[0..4]) = h[5] - h[0] * b^5 = 986019 - 0 = 986019
  hash(s[1..5]) = h[6] - h[1] * b^5

Each new window hash is one subtraction and one lookup -- O(1) per position.

Step 4 -- Compare pattern hash to each window hash.

  pattern "abcab" hash = 986019
  Window i=0: hash(s[0..4]) = 986019  -> MATCH (verify chars to confirm)
  Window i=1: hash(s[1..5]) = ...     -> compare in O(1)
  Window i=2: hash(s[2..6]) = ...     -> compare in O(1)
  ...
  Total: O(n) hash comparisons instead of O(n * m) char comparisons

The invariant

hash(s[l..r]) = (h[r+1]h[l]brl+1) mod p

Every substring of s maps to a number in [0,p). Two equal substrings always produce the same hash. Two different substrings may collide (same hash), but with a random base and large prime p, the collision probability per pair is 1/p.

Use double hashing (two independent (b,p) pairs) to push collision probability to 1/(p1p2)1018 -- effectively zero for contest constraints (n106).

  Birthday paradox: collision probability vs. number of hashes

  Single hash (mod ~= 10⁹):
  
  Hashes:    100    1,000   10,000   30,000   100,000
  P(col):   ~0%    ~0.05%   ~5%     ~50%     ~99.9%
             ─       ─       ░░      ████     ████████
  
  Double hash (mod ~= 10¹⁸):
  
  Hashes:    100    1,000   10⁶      10⁹
  P(col):   ~0%    ~0%     ~0%      ~0.0000005%
             ─       ─       ─        ─
  
  With n = 10⁵ substrings and single hash:
    Pairs ~= 5 x 10⁹  ->  EXPECT ~5 collisions  ->  WA
  
  With double hash:
    Pairs ~= 5 x 10⁹  ->  EXPECT ~5 x 10⁻⁹ collisions  ->  SAFE

Fingerprints, not proofs

It is worth being precise about what a hash equality means.

  • Mathematical equality of two substrings is a property of the strings themselves: every character at every offset matches. KMP, Z, suffix arrays, and direct character comparison all establish this.
  • Hash equality says only that two strings produce the same fingerprint under your chosen (base, mod). Equal strings always have equal hashes; the reverse is "almost always" true under double hashing with random parameters.

Two practical consequences:

  1. For most contest problems, double hashing is reliable enough that treating "hash equal" as "string equal" is fine. The expected number of collisions over a contest's worth of comparisons is effectively zero.

  2. When the cost of one wrong answer is total failure -- e.g., you build a graph of "equal" substrings and run DFS, or you commit to an output based on a single equality check -- consider verifying surviving candidates by direct character comparison after the hash filter:

    cpp
    if (hs.query(l1, r1) == hs.query(l2, r2)
        && equal(s.begin() + l1, s.begin() + r1 + 1, s.begin() + l2)) {
        // confirmed equal
    }

    The verification is O(len) per surviving pair, but you only pay it on hash matches, so total cost stays close to the hash-only version.

This is the core editorial takeaway: hashing is randomized contest reliability, not mathematical equality. Use it freely, but design your algorithm so a single false positive cannot silently corrupt the rest of the work.

What the Code Won't Teach You

When exact algorithms are strictly better. Hashing is probabilistic -- a collision gives a wrong answer. KMP and suffix arrays are deterministic. For problems where you string together multiple equality checks and act on them irreversibly (e.g., building a graph of equal substrings, then running DFS), a single hash collision cascades into completely wrong output. Exact algorithms don't have this failure mode. Know when the risk is acceptable.

The birthday paradox is your enemy. With a single mod 109, you need only 10930,000 distinct hash values before a 50% chance of collision. If you're hashing all O(n2) substrings for n=3000, that's 4.5×106 hashes -- collisions are virtually guaranteed. Double hashing (1018 collision resistance) is not overkill; it's the minimum for safety.

Hashing on unordered_map is double jeopardy. If you store hash values in an unordered_map, the map itself uses another hash function on your keys. A collision in the map's internal hash slows lookups to O(n); a collision in your string hash gives wrong answers. Both can be exploited by adversarial inputs on Codeforces. Use map (slower but safe) or a custom splitmix64 hash for the container.

Defending against anti-hash tests. Codeforces allows "hacking" with adversarial inputs, and there are well-known techniques for constructing strings that collide under common hash parameters. Your defenses: (1) randomize the base and mod at runtime so attackers can't precompute collisions, (2) use double hashing with two independent (mod, base) pairs, (3) avoid the popular mods like 109+7 or 109+9 -- use large random primes instead. A hashing solution that works on pretests but dies to a hack is worse than a slower deterministic one that passes everything.

The mental model: hashing is fingerprinting, not identification. Think of a hash as a fingerprint -- if two fingerprints differ, the strings are definitely different; if they match, the strings are probably the same. This one-sided guarantee means hashing excels at disproving equality (filter out non-matches fast) but can't prove it. Design your algorithms around this asymmetry. Use hashing to narrow candidates, then verify the survivors with exact comparisons if correctness demands it.

Where hashing truly shines over heavier machinery. Suffix arrays and automata are powerful but rigid -- they process one fixed string. Hashing is the most flexible string tool you have: compare substrings of different strings, hash 2D grids or trees, combine with binary search to find longest common prefixes, or roll hashes across a sliding window. Anytime a problem involves comparing pieces from multiple sources or the structure doesn't fit neatly into suffix-array territory, hashing is often the simplest correct approach.

When to reach for this

Trigger phrases in problem statements:

  • "compare substrings for equality"
  • "pattern matching" / "find all occurrences"
  • "count distinct substrings"
  • "check if a substring is a palindrome"
  • "longest common substring" (combine with binary search)
  • "string period" / "string has a repeating pattern"

Counter-examples -- when hashing is NOT the best tool:

  • You need a guaranteed exact match with no false positives -> use KMP or Z-function (deterministic).
  • Single-pattern search where simplicity matters -> KMP may be simpler and equally fast.
  • You need the full suffix array or LCP array -> build a suffix array directly.
  • The problem involves edits/updates to the string -> hashing requires O(n) rebuild (consider segment tree + hashing for point updates).

C++ STL Reference

String hashing is not provided by the STL -- you implement it yourself. These STL tools help:

Function / ClassHeaderWhat it doesTime Complexity
string::operator[]<string>Access character at index (use for hash value)O(1)
string::substr(pos, len)<string>Extract substring (avoid in hot loops -- O(n))O(n)
string::size()<string>Length of stringO(1)
mt19937 rng(chrono::steady_clock::now().time_since_epoch().count())<random>, <chrono>Randomized base to defeat anti-hash testsO(1)
uniform_int_distribution<int><random>Pick random base in range [lo,hi]O(1)
__int128built-in (GCC)Avoid overflow in modular multiplicationO(1)

Key detail: Don't use std::hash<string> for CP -- it's not portable, not invertible over substrings, and trivially hackable on Codeforces.


Implementation (Contest-Ready)

Version 1: Single Hash -- Minimal Template

cpp
#include <bits/stdc++.h>
using namespace std;

struct SingleHash {
    static const long long MOD = 1e9 + 7;
    static const long long BASE = 131;  // pick randomly in contest
    int n;
    vector<long long> h, pw;

    SingleHash(const string& s) : n(s.size()), h(n + 1), pw(n + 1) {
        pw[0] = 1;
        for (int i = 0; i < n; i++) {
            h[i + 1] = (h[i] * BASE + s[i]) % MOD;
            pw[i + 1] = pw[i] * BASE % MOD;
        }
    }

    // hash of s[l..r] (0-indexed, inclusive)
    long long query(int l, int r) {
        return (h[r + 1] - h[l] * pw[r - l + 1] % MOD + MOD) % MOD;
    }
};

int main() {
    string s;
    cin >> s;
    SingleHash sh(s);

    int q;
    cin >> q;
    while (q--) {
        int l1, r1, l2, r2;
        cin >> l1 >> r1 >> l2 >> r2;
        cout << (sh.query(l1, r1) == sh.query(l2, r2) ? "Yes" : "No") << "\n";
    }
}

Version 2: Double Hash with Substring Query -- Explained

cpp
#include <bits/stdc++.h>
using namespace std;

struct DoubleHash {
    // Two large primes far from powers of 2.
    // Randomize BASE1, BASE2 in contest to resist hacking.
    static constexpr long long MOD1 = 1e9 + 7, MOD2 = 1e9 + 9;
    long long BASE1, BASE2;
    int n;
    vector<long long> h1, h2, pw1, pw2;

    DoubleHash(const string& s, long long b1 = 131, long long b2 = 137)
        : BASE1(b1), BASE2(b2), n(s.size()),
          h1(n + 1), h2(n + 1), pw1(n + 1), pw2(n + 1) {
        pw1[0] = pw2[0] = 1;
        for (int i = 0; i < n; i++) {
            // Build prefix hashes: h[i+1] = h[i] * base + s[i]
            h1[i + 1] = (h1[i] * BASE1 + s[i]) % MOD1;
            h2[i + 1] = (h2[i] * BASE2 + s[i]) % MOD2;
            // Precompute powers of base
            pw1[i + 1] = pw1[i] * BASE1 % MOD1;
            pw2[i + 1] = pw2[i] * BASE2 % MOD2;
        }
    }

    // Returns (hash1, hash2) for substring s[l..r] (0-indexed, inclusive).
    // Formula: h[r+1] - h[l] * base^(r-l+1), taken mod.
    // The +MOD before final %MOD handles C++ negative remainder.
    pair<long long, long long> query(int l, int r) {
        int len = r - l + 1;
        long long v1 = (h1[r + 1] - h1[l] * pw1[len] % MOD1 + MOD1) % MOD1;
        long long v2 = (h2[r + 1] - h2[l] * pw2[len] % MOD2 + MOD2) % MOD2;
        return {v1, v2};
    }

    // Compare two substrings for equality in O(1).
    bool equal(int l1, int r1, int l2, int r2) {
        return query(l1, r1) == query(l2, r2);
    }
};

// Randomize bases at startup to defeat anti-hash tests.
mt19937 rng(chrono::steady_clock::now().time_since_epoch().count());
long long rand_base(long long lo = 256, long long hi = 1e9) {
    return uniform_int_distribution<long long>(lo, hi)(rng);
}

int main() {
    ios::sync_with_stdio(false);
    cin.tie(nullptr);

    string s, t;
    cin >> s >> t;

    long long b1 = rand_base(), b2 = rand_base();
    DoubleHash hs(s, b1, b2), ht(t, b1, b2);

    // Example: find all occurrences of t in s (Rabin-Karp)
    int n = s.size(), m = t.size();
    auto target = ht.query(0, m - 1);
    vector<int> matches;
    for (int i = 0; i + m - 1 < n; i++) {
        if (hs.query(i, i + m - 1) == target)
            matches.push_back(i);
    }

    cout << matches.size() << "\n";
    for (int pos : matches) cout << pos + 1 << " ";
    cout << "\n";
}

Operations Reference

All operations assume the prefix hash array is already built. The O(1) substring query is what makes hashing competitive with dedicated string algorithms.

OperationTimeSpace
Build prefix hashes (length n)O(n)O(n)
Query hash of substring s[l..r]O(1)--
Compare two substringsO(1)--
Rabin-Karp: find pattern m in text nO(n+m)O(n+m)
Hash + binary search (e.g., longest common prefix)O(nlogn)O(n)
Hash all substrings of length kO(n)O(n)
Build hash for two stringsO(n+m)O(n+m)
Rebuild hash after single-char updateO(n)--

Problem Patterns & Tricks

Pattern 1: Rabin-Karp String Matching

Find all occurrences of pattern t in text s. Hash t, then slide a window of length |t| over s comparing hashes.

cpp
auto target = ht.query(0, m - 1);
for (int i = 0; i + m <= n; i++)
    if (hs.query(i, i + m - 1) == target)
        // match at position i

Problems: CF 271D, CF 126B

Pattern 2: Longest Common Substring via Binary Search + Hashing

Binary search on length k. For each k, hash all substrings of length k from both strings, check for common hash in a set.

cpp
auto check = [&](int k) -> bool {
    unordered_set<long long> seen;
    for (int i = 0; i + k <= n; i++)
        seen.insert(combine(hs.query(i, i + k - 1)));
    for (int i = 0; i + k <= m; i++)
        if (seen.count(combine(ht.query(i, i + k - 1))))
            return true;
    return false;
};
// binary search on k in [0, min(n, m)]

Total: O((n+m)logmin(n,m)).

Problems: CF 580E

Pattern 3: Palindrome Check

A substring s[l..r] is a palindrome iff its forward hash equals its reverse hash. Build hashes for both s and reverse(s).

cpp
// forward hash of s[l..r]
auto fwd = hs.query(l, r);
// reverse hash: position in reversed string is (n-1-r)..(n-1-l)
auto rev = hr.query(n - 1 - r, n - 1 - l);
bool is_palindrome = (fwd == rev);

Problems: CF 7D, CF 1326D2

Pattern 4: Count Distinct Substrings

Hash every substring of every length, insert into a set. The set size is the answer.

cpp
set<pair<long long,long long>> distinct;
for (int len = 1; len <= n; len++)
    for (int i = 0; i + len <= n; i++)
        distinct.insert(hs.query(i, i + len - 1));
cout << distinct.size();

O(n2logn) -- good enough for n3000. For larger n, use suffix array.

Problems: CF 271D, SPOJ DISUBSTR

Binary search on the length of the common prefix of s[i..] and s[j..]. Check equality via hash.

cpp
int lcp(int i, int j) {
    int lo = 0, hi = min(n - i, n - j);
    while (lo < hi) {
        int mid = lo + (hi - lo + 1) / 2;
        if (hs.query(i, i + mid - 1) == hs.query(j, j + mid - 1))
            lo = mid;
        else
            hi = mid - 1;
    }
    return lo;
}

O(logn) per query. Useful as a building block for string comparison, suffix array construction, etc.

Problems: CF 126B

  Binary search on LCP length using hash comparison:

  s = "a b c a b d a b c a b e"
       0 1 2 3 4 5 6 7 8 9 ...

  LCP of suffix at i=0 vs suffix at i=6:

  lo=0                          hi=6
  ├─────────────────────────────┤
          mid=3
          hash(s[0..2]) vs hash(s[6..8])
          "abc" == "abc"  ->  match! lo=3
                    mid=4
                    hash(s[0..3]) vs hash(s[6..9])
                    "abca" == "abca"  ->  match! lo=4
                              mid=5
                              hash(s[0..4]) vs hash(s[6..10])
                              "abcab" == "abcab"  ->  match! lo=5
                                        mid=5 (converged)
                                        -> LCP = 5

  Each step: O(1) hash comparison.  Total: O(log n).

Pattern 6: Hashing + DP (String Periods, Repetitions)

Use hashing to check if a string has period p: s[i]=s[i+p] for all valid i, equivalent to checking h(s[0..np1])=h(s[p..n1]).

cpp
for (int p = 1; p <= n; p++)
    if (n % p == 0 && hs.query(0, n - p - 1) == hs.query(p, n - 1))
        // s has period p

Problems: CF 182D, CF 25E


Contest Cheat Sheet

+-------------------------------------------------------+
|              STRING HASHING CHEAT SHEET                |
+-------------------------------------------------------+
| WHEN TO USE                                           |
|   - Substring equality queries                        |
|   - Pattern matching (Rabin-Karp)                     |
|   - "Count distinct substrings" up to n ~ 3000        |
|   - Palindrome checking with reverse hash             |
|   - Binary search on string properties (LCP, LCS)    |
+-------------------------------------------------------+
| BUILD                                                 |
|   pw[0]=1; for i in 0..n-1:                           |
|     h[i+1] = (h[i]*B + s[i]) % M                     |
|     pw[i+1] = pw[i]*B % M                             |
+-------------------------------------------------------+
| QUERY s[l..r] (0-indexed, inclusive)                  |
|   (h[r+1] - h[l]*pw[r-l+1] % M + M) % M             |
+-------------------------------------------------------+
| SAFE PARAMS                                           |
|   Double hash: MOD = 1e9+7, 1e9+9                    |
|   BASE: random in [256, 1e9]                          |
|   Randomize at runtime to beat hacks                  |
+-------------------------------------------------------+
| COMPLEXITY   Build: O(n)   Query: O(1)               |
+-------------------------------------------------------+

Gotchas & Debugging

⚠️ Single hashing is NOT safe. With one hash and mod ~10^9, birthday paradox gives ~50% collision probability at ~30,000 strings. On Codeforces, adversarial tests WILL break single hashing.

Always use double hashing (two independent bases and mods):

cpp
const long long MOD1 = 1e9 + 7, MOD2 = 1e9 + 9;
const long long BASE1 = 131, BASE2 = 137;
// Store pair {hash1, hash2} -- compare both

Collision probability with double hashing: ~1/(MOD1 * MOD2) ~= 10^{-18} per pair.

Also: When computing hash differences for substrings, add MOD before taking %: (h[r] - h[l-1] * pw[r-l+1] % MOD + MOD) % MOD -- the subtraction can go negative.

1. Negative modular arithmetic. C++ % can return negative values when the dividend is negative. The expression h[r+1] - h[l] * pw[len] % MOD can go negative. Always add +MOD before the final %MOD:

cpp
(h[r+1] - h[l] * pw[len] % MOD + MOD) % MOD
//                                ^^^^

2. Operator precedence trap.h[l] * pw[len] % MOD is parsed as (h[l] * pw[len]) % MOD because * and % have the same precedence and associate left-to-right. This is correct here. But if you write h[r+1] - h[l] * pw[len] + MOD) % MOD without the inner % MOD, the multiplication overflows long long. Always reduce after each multiply.

3. Collision probability -- don't use single hash in contests with hacking. A single mod 109 gives 109 collision probability per pair. With n105 substrings, you make 1010 pair comparisons -- expect collisions. Use double hashing or pick m>1018 (via __int128).

4. Choosing bases.

  • Don't use BASE = 26 or BASE = 31 -- they're the first thing anti-hash tests target.
  • Randomize at runtime: BASE = uniform_int_distribution<long long>(256, 1e9)(rng).
  • Make sure BASE > 256 so all ASCII values are distinct modulo the base.

5. Off-by-one in prefix hash indexing. The prefix hash array is 1-indexed: h[0] = 0, h[i] = hash of s[0..i-1]. The substring s[l..r] uses h[r+1] and h[l]. Mixing 0-indexed and 1-indexed is the #1 source of bugs. Write a query(l, r) function and always go through it.

6. Comparing substrings from different strings. Both strings must use the same base and mod. Build both hash objects with identical parameters. If you randomize, generate the base once and pass it to both constructors.

7. Hash of empty string. By convention h[0] = 0. Querying query(l, l-1) (empty range) should return 0. Make sure your query function handles this or you never call it with invalid ranges.

8. TLE from unordered_set with bad hash.unordered_set<pair<long long,long long>> doesn't compile by default (no hash for pair). Either combine the two hashes into one long long via v1 * MOD2 + v2, or use a custom hash, or use map/set (with log factor).

Mental Traps

Trap 1: "If my hash passes the samples, it won't collide on the judge."

This is exactly backwards. Problem setters on Codeforces actively add anti-hash tests after seeing common base/mod choices in previous rounds. If you use base=31, mod=1e9+7 -- appearing in thousands of submissions -- your hash is already known. Adversarial inputs can be constructed to force collisions with known parameters. Always randomize the base at runtime and always use double hashing.

  The adversary's workflow on Codeforces:

  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐
  │ Read your    │    │ Find two     │    │ Submit hack  │
  │ submission:  │───>│ strings with │───>│ with those   │
  │ base=31      │    │ same hash    │    │ strings as   │
  │ mod=1e9+7    │    │ under (31,   │    │ test input   │
  │              │    │  1e9+7)      │    │              │
  └──────────────┘    └──────────────┘    └──────────────┘
  
  Defense: randomize base at runtime
    mt19937 rng(chrono::steady_clock::now()...);
    BASE = uniform_int_distribution<ll>(256, 1e9)(rng);
  
  Adversary can't read your runtime-chosen base -> can't hack.

Trap 2: "Double hashing is overkill for this problem."

It adds 5 lines of code and doubles a tiny constant factor. The alternative -- getting hacked or failing a hidden test -- wastes an entire submission and potentially your contest ranking. Single hashing is a coin flip on Codeforces; double hashing is insurance that costs nothing.



🐛 When Your Solution is Wrong

  • WA (Wrong Answer):
    • Hash collision! Single hash has ~1/MOD collision probability. Use double hashing (two different bases/mods) for 1012 collision chance.
    • Wrong modular arithmetic (negative remainders in some languages)
    • Off-by-one in substring hash extraction
  • TLE (Time Limit Exceeded):
    • Recomputing hash from scratch for each substring instead of using prefix hash array
    • Using string::substr() which copies -- O(n) per call
  • Anti-hack tip: Randomize the base at runtime: base = rng() % (MOD - 2) + 2.

Common Mistakes

  1. Single hash collisions. Birthday paradox makes collisions likely above ~10^5 hashes. Always use double hashing (two independent base/mod pairs).
  2. Negative values from subtraction. (a - b) % p can be negative in C++. Fix: ((a - b) % p + p) % p.
  3. Wrong power precomputation. pw[0] must be 1 (not base), and powers must be precomputed up to the full string length.
  4. Hash comparison without collision awareness. Hash equality means "probably equal." Always verify with direct comparison in critical applications, or use double hashing.
  5. Anti-hash tests. Some competitive programming judges use anti-hash tests. Double hashing with large primes (e.g., 10^9 + 7 and 10^9 + 9) is the standard defense.
  6. Hardcoded base on adversarial judges. "I hashed every substring correctly but got Wrong Answer on test 37..." You used base = 31 and mod = 1e9+7 -- the exact constants from the cp-algorithms template. The problem setter generated an anti-hash test that forces collisions for these specific values. Switching to a random base chosen at runtime (mt19937) and a second modulus made the solution robust. Never hardcode the base in a problem where adversarial tests are possible.

Debug Drills

Drill 1 -- Off-by-one in substring hash extraction

cpp
// Hash of s[l..r] (0-indexed, inclusive)
long long get_hash(int l, int r) {
    return (h[r] - h[l] * pw[r - l + 1]) % MOD;
}
What's wrong?

It should be h[r+1] - h[l] * pw[r - l + 1] (if h is 1-indexed prefix hash where h[i] = hash of s[0..i-1]). Or if h[r] means hash of first r+1 chars, then the subtraction index is wrong. The classic fix:

cpp
long long get_hash(int l, int r) {
    return ((h[r + 1] - h[l] * pw[r - l + 1]) % MOD + MOD) % MOD;
}

The + MOD) % MOD also prevents negative remainders.

Drill 2 -- Power array overflow

cpp
pw[0] = 1;
for (int i = 1; i <= n; i++)
    pw[i] = pw[i - 1] * BASE;  // no modular reduction!
What's wrong?

Without % MOD, pw[i] overflows even long long for i > 40 or so. Every multiplication in the power array must be taken modulo MOD:

cpp
pw[i] = pw[i - 1] * BASE % MOD;

Drill 3 -- Comparing hashes from different mod systems

cpp
if (get_hash1(l1, r1) == get_hash2(l2, r2))  // hash1 uses MOD1, hash2 uses MOD2
    cout << "match";
What's wrong?

You're comparing a hash computed modulo MOD1 with one computed modulo MOD2. They can be equal by coincidence even for completely different strings. Both hashes must use the same (base, mod) pair, or you must compare (hash1_mod1, hash1_mod2) == (hash2_mod1, hash2_mod2) for double hashing.


Self-Test

Before moving to problems, verify you can do these without looking at the notes:

  • [ ] Implement double polynomial hashing from memory -- two hash arrays with two (base, mod) pairs, including + MOD in subtraction and power precomputation
  • [ ] State the collision probability for single vs. double hash with 106 comparisons and p109, and explain why double hashing is adversary-proof
  • [ ] Write the formula to extract hash of s[l..r]: (h[r+1]h[l]brl+1)modp, and explain why + MOD is needed
  • [ ] Name two situations where hashing is the right tool (binary search on string properties, multiple substring equality checks) and two where exact algorithms are better (guaranteed correctness, single pattern search)
  • [ ] State the minimum safe base for lowercase-only strings (>26) and explain why base = 26 causes non-collision false matches

Practice Problems

CF RatingWhat You Should Be Able to Do
1200Compute a polynomial hash; check if a pattern occurs in a text (single hash).
1500Use prefix hashes for O(1) substring comparison; find all occurrences of a pattern.
1800Double hashing to resist anti-hash tests; rolling hash for sliding-window substring problems.
2100Combine hashing with binary search (e.g., longest common substring); hash on trees or 2-D grids.
#ProblemSourceDifficultyKey Idea
1Substring Search (CSES)CSES1400Rabin-Karp pattern matching
2Password (CF 126B)CF1700Prefix = suffix = interior substring via hashing + Z/KMP
3Good Substrings (CF 271D)CF1500Count distinct substrings with character constraint
4Palindrome Degree (CF 7D)CF1700Recursive palindrome check with hashing
5Yet Another Palindrome Problem (CF 1326D2)CF1800Greedy + palindrome hashing
6Kuro and GCD and XOR and SUM (CF 182D)CF1500String period detection
7Shortest Superstring (CF 25E)CF2200Overlap via hashing + greedy
8String Hashing (CSES)CSES1400Direct application of polynomial hashing
9Palindrome Pairs (CF 17E)CF2400Hashing + combinatorics for palindrome counting
10MUH and Cube Walls (CF 471D)CF1800Difference array + Rabin-Karp matching

Further Reading


Solve With Me -- CF 271D "Good Substrings"

Problem: given a string s and a classification of each letter as "good" or "bad," count the number of distinct substrings of s that contain at most k bad characters.

First thought: enumerate all substrings, count bad chars in each, and collect distinct ones. There are O(n2) substrings, and I can maintain a bad-character count incrementally as I extend the right endpoint. But how to count distinct substrings? Throwing them into a set<string> is O(n2logn) per insertion with string comparisons -- and n can be 1500, so n22.25×106 substrings, each up to length 1500. That's actually... maybe borderline? But string hashing lets me do this cleanly.

Plan: for each starting position i, extend j from i to n1. Maintain a running hash and a bad-char count. If bad count k, insert the hash into an unordered_set. At the end, the set size is the answer.

But single hashing has collision risk. With 106 elements, birthday paradox says I need a modulus well above 1012 to be safe. I'll use double hashing -- two independent (base, mod) pairs, and store pair<long long, long long> in the set.

Wait, unordered_set doesn't hash pairs by default in C++. I need a custom hash or I can pack two 64-bit values into a single 128-bit key... or just use set<pair<long long, long long>> with O(logn) insertion. With 2.25×106 insertions, that's about 2×107 operations. Fine for 2 seconds.

Coded it up, AC. The key insight was that hashing makes "distinct substring" queries trivial -- each substring is just a number (or pair of numbers), and set insertion handles uniqueness.

What to recognize next time: "Count distinct substrings with property X" -> hash each substring, throw hashes into a set. The per-substring property check is the easy part; hashing is what makes distinctness tractable without a suffix structure.


Recap

  • Polynomial hash: h(s) = sum(s[i] * base^i) mod p. Precompute prefix hashes for O(1) substring queries.
  • Always double-hash with two independent (base, mod) pairs to avoid collisions.
  • Use cases: substring equality, distinct substring counting, string matching, longest common substring via binary search + hashing.
  • Limitation: hashing is probabilistic. For hack-proof solutions, consider suffix arrays or KMP.

Built for the climb from Codeforces 1100 to 2100.