Appearance
String Hashing
AL-34 | Map strings to integers via polynomial hashing -- compare substrings in
Difficulty: [Intermediate]Prerequisites: Arrays and Strings, Math FundamentalsCF Rating Range: 1400 -- 1800
Contest Frequency: ** Regular -- common in string comparison problems
Quick Revisit
- USE WHEN: fast substring comparison, counting distinct substrings, or string matching
- INVARIANT: hash(s[l..r]) computed in O(1) from prefix hashes and precomputed powers
- TIME: O(n) preprocess, O(1) per query | SPACE: O(n)
- CLASSIC BUG: single hash causes collisions -- always use double hashing (two mod values)
- PRACTICE: Strings Practice Ladder
Contents
- Intuition
- C++ STL Reference
- Implementation (Contest-Ready)
- Operations Reference
- Problem Patterns & Tricks
- Contest Cheat Sheet
- Gotchas & Debugging
- Self-Test
- Practice Problems
- Further Reading
Intuition
Suppose you want to check whether a pattern
The naive approach: slide
s: [a b c a b c d a b c a b d a b c a b c a] n = 20
p: [a b c a b] m = 5
Position 0: compare s[0..4] vs p[0..4] -> 5 comparisons
Position 1: compare s[1..5] vs p[0..4] -> up to 5 comparisons
...
Position 15: compare s[15..19] vs p[0..4] -> up to 5 comparisons
---------------------------------------------------------------
Total positions: n - m + 1 = 16
Worst-case comparisons: 16 * 5 = 80 = O(n * m)For
Convert each substring to a single number (its hash) so that comparing two substrings takes
Analogy: Imagine you have a phone book with millions of names. Comparing two names letter-by-letter is slow. But if every name had a unique phone number, you could compare two numbers in a single step. A polynomial hash is that phone number -- it compresses an arbitrary-length string into a fixed-size integer. Two different strings might share a number (collision), but with a good hash this is astronomically unlikely.
Visual walkthrough
We hash the string "abcab" using base
Step 1 -- Build prefix hashes left to right.
Each prefix hash absorbs one more character:
i: 0 1 2 3 4
s[i]: a(1) b(2) c(3) a(1) b(2)
h[0] = 0
h[1] = h[0]*31 + 1 = 0 + 1 = 1
h[2] = h[1]*31 + 2 = 31 + 2 = 33
h[3] = h[2]*31 + 3 = 1023 + 3 = 1026
h[4] = h[3]*31 + 1 = 31806 + 1 = 31807
h[5] = h[4]*31 + 2 = 986017 + 2 = 986019Step 2 -- Extract hash of a substring in
Want hash of "cab". The prefix hash "abcab" and "ab". Subtract the contribution of "ab" shifted into the high digits:
hash(s[2..4]) = h[5] - h[2] * b^3
= 986019 - 33 * 29791
= 986019 - 983103
= 2916 (all mod some prime p)Verify directly:
Step 3 -- Slide the window: rolling from
hash(s[0..4]) = h[5] - h[0] * b^5 = 986019 - 0 = 986019
hash(s[1..5]) = h[6] - h[1] * b^5Each new window hash is one subtraction and one lookup --
Step 4 -- Compare pattern hash to each window hash.
pattern "abcab" hash = 986019
Window i=0: hash(s[0..4]) = 986019 -> MATCH (verify chars to confirm)
Window i=1: hash(s[1..5]) = ... -> compare in O(1)
Window i=2: hash(s[2..6]) = ... -> compare in O(1)
...
Total: O(n) hash comparisons instead of O(n * m) char comparisonsThe invariant
hash(
) = ( ) mod Every substring of
maps to a number in . Two equal substrings always produce the same hash. Two different substrings may collide (same hash), but with a random base and large prime , the collision probability per pair is .
Use double hashing (two independent
Birthday paradox: collision probability vs. number of hashes
Single hash (mod ~= 10⁹):
Hashes: 100 1,000 10,000 30,000 100,000
P(col): ~0% ~0.05% ~5% ~50% ~99.9%
─ ─ ░░ ████ ████████
Double hash (mod ~= 10¹⁸):
Hashes: 100 1,000 10⁶ 10⁹
P(col): ~0% ~0% ~0% ~0.0000005%
─ ─ ─ ─
With n = 10⁵ substrings and single hash:
Pairs ~= 5 x 10⁹ -> EXPECT ~5 collisions -> WA
With double hash:
Pairs ~= 5 x 10⁹ -> EXPECT ~5 x 10⁻⁹ collisions -> SAFEFingerprints, not proofs
It is worth being precise about what a hash equality means.
- Mathematical equality of two substrings is a property of the strings themselves: every character at every offset matches. KMP, Z, suffix arrays, and direct character comparison all establish this.
- Hash equality says only that two strings produce the same fingerprint under your chosen
(base, mod). Equal strings always have equal hashes; the reverse is "almost always" true under double hashing with random parameters.
Two practical consequences:
For most contest problems, double hashing is reliable enough that treating "hash equal" as "string equal" is fine. The expected number of collisions over a contest's worth of comparisons is effectively zero.
When the cost of one wrong answer is total failure -- e.g., you build a graph of "equal" substrings and run DFS, or you commit to an output based on a single equality check -- consider verifying surviving candidates by direct character comparison after the hash filter:
cppif (hs.query(l1, r1) == hs.query(l2, r2) && equal(s.begin() + l1, s.begin() + r1 + 1, s.begin() + l2)) { // confirmed equal }The verification is
per surviving pair, but you only pay it on hash matches, so total cost stays close to the hash-only version.
This is the core editorial takeaway: hashing is randomized contest reliability, not mathematical equality. Use it freely, but design your algorithm so a single false positive cannot silently corrupt the rest of the work.
What the Code Won't Teach You
When exact algorithms are strictly better. Hashing is probabilistic -- a collision gives a wrong answer. KMP and suffix arrays are deterministic. For problems where you string together multiple equality checks and act on them irreversibly (e.g., building a graph of equal substrings, then running DFS), a single hash collision cascades into completely wrong output. Exact algorithms don't have this failure mode. Know when the risk is acceptable.
The birthday paradox is your enemy. With a single mod
Hashing on unordered_map is double jeopardy. If you store hash values in an unordered_map, the map itself uses another hash function on your keys. A collision in the map's internal hash slows lookups to map (slower but safe) or a custom splitmix64 hash for the container.
Defending against anti-hash tests. Codeforces allows "hacking" with adversarial inputs, and there are well-known techniques for constructing strings that collide under common hash parameters. Your defenses: (1) randomize the base and mod at runtime so attackers can't precompute collisions, (2) use double hashing with two independent (mod, base) pairs, (3) avoid the popular mods like
The mental model: hashing is fingerprinting, not identification. Think of a hash as a fingerprint -- if two fingerprints differ, the strings are definitely different; if they match, the strings are probably the same. This one-sided guarantee means hashing excels at disproving equality (filter out non-matches fast) but can't prove it. Design your algorithms around this asymmetry. Use hashing to narrow candidates, then verify the survivors with exact comparisons if correctness demands it.
Where hashing truly shines over heavier machinery. Suffix arrays and automata are powerful but rigid -- they process one fixed string. Hashing is the most flexible string tool you have: compare substrings of different strings, hash 2D grids or trees, combine with binary search to find longest common prefixes, or roll hashes across a sliding window. Anytime a problem involves comparing pieces from multiple sources or the structure doesn't fit neatly into suffix-array territory, hashing is often the simplest correct approach.
When to reach for this
Trigger phrases in problem statements:
- "compare substrings for equality"
- "pattern matching" / "find all occurrences"
- "count distinct substrings"
- "check if a substring is a palindrome"
- "longest common substring" (combine with binary search)
- "string period" / "string has a repeating pattern"
Counter-examples -- when hashing is NOT the best tool:
- You need a guaranteed exact match with no false positives -> use KMP or Z-function (deterministic).
- Single-pattern search where simplicity matters -> KMP may be simpler and equally fast.
- You need the full suffix array or LCP array -> build a suffix array directly.
- The problem involves edits/updates to the string -> hashing requires
rebuild (consider segment tree + hashing for point updates).
C++ STL Reference
String hashing is not provided by the STL -- you implement it yourself. These STL tools help:
| Function / Class | Header | What it does | Time Complexity |
|---|---|---|---|
string::operator[] | <string> | Access character at index (use for hash value) | |
string::substr(pos, len) | <string> | Extract substring (avoid in hot loops -- | |
string::size() | <string> | Length of string | |
mt19937 rng(chrono::steady_clock::now().time_since_epoch().count()) | <random>, <chrono> | Randomized base to defeat anti-hash tests | |
uniform_int_distribution<int> | <random> | Pick random base in range | |
__int128 | built-in (GCC) | Avoid overflow in modular multiplication |
Key detail: Don't use std::hash<string> for CP -- it's not portable, not invertible over substrings, and trivially hackable on Codeforces.
Implementation (Contest-Ready)
Version 1: Single Hash -- Minimal Template
cpp
#include <bits/stdc++.h>
using namespace std;
struct SingleHash {
static const long long MOD = 1e9 + 7;
static const long long BASE = 131; // pick randomly in contest
int n;
vector<long long> h, pw;
SingleHash(const string& s) : n(s.size()), h(n + 1), pw(n + 1) {
pw[0] = 1;
for (int i = 0; i < n; i++) {
h[i + 1] = (h[i] * BASE + s[i]) % MOD;
pw[i + 1] = pw[i] * BASE % MOD;
}
}
// hash of s[l..r] (0-indexed, inclusive)
long long query(int l, int r) {
return (h[r + 1] - h[l] * pw[r - l + 1] % MOD + MOD) % MOD;
}
};
int main() {
string s;
cin >> s;
SingleHash sh(s);
int q;
cin >> q;
while (q--) {
int l1, r1, l2, r2;
cin >> l1 >> r1 >> l2 >> r2;
cout << (sh.query(l1, r1) == sh.query(l2, r2) ? "Yes" : "No") << "\n";
}
}Version 2: Double Hash with Substring Query -- Explained
cpp
#include <bits/stdc++.h>
using namespace std;
struct DoubleHash {
// Two large primes far from powers of 2.
// Randomize BASE1, BASE2 in contest to resist hacking.
static constexpr long long MOD1 = 1e9 + 7, MOD2 = 1e9 + 9;
long long BASE1, BASE2;
int n;
vector<long long> h1, h2, pw1, pw2;
DoubleHash(const string& s, long long b1 = 131, long long b2 = 137)
: BASE1(b1), BASE2(b2), n(s.size()),
h1(n + 1), h2(n + 1), pw1(n + 1), pw2(n + 1) {
pw1[0] = pw2[0] = 1;
for (int i = 0; i < n; i++) {
// Build prefix hashes: h[i+1] = h[i] * base + s[i]
h1[i + 1] = (h1[i] * BASE1 + s[i]) % MOD1;
h2[i + 1] = (h2[i] * BASE2 + s[i]) % MOD2;
// Precompute powers of base
pw1[i + 1] = pw1[i] * BASE1 % MOD1;
pw2[i + 1] = pw2[i] * BASE2 % MOD2;
}
}
// Returns (hash1, hash2) for substring s[l..r] (0-indexed, inclusive).
// Formula: h[r+1] - h[l] * base^(r-l+1), taken mod.
// The +MOD before final %MOD handles C++ negative remainder.
pair<long long, long long> query(int l, int r) {
int len = r - l + 1;
long long v1 = (h1[r + 1] - h1[l] * pw1[len] % MOD1 + MOD1) % MOD1;
long long v2 = (h2[r + 1] - h2[l] * pw2[len] % MOD2 + MOD2) % MOD2;
return {v1, v2};
}
// Compare two substrings for equality in O(1).
bool equal(int l1, int r1, int l2, int r2) {
return query(l1, r1) == query(l2, r2);
}
};
// Randomize bases at startup to defeat anti-hash tests.
mt19937 rng(chrono::steady_clock::now().time_since_epoch().count());
long long rand_base(long long lo = 256, long long hi = 1e9) {
return uniform_int_distribution<long long>(lo, hi)(rng);
}
int main() {
ios::sync_with_stdio(false);
cin.tie(nullptr);
string s, t;
cin >> s >> t;
long long b1 = rand_base(), b2 = rand_base();
DoubleHash hs(s, b1, b2), ht(t, b1, b2);
// Example: find all occurrences of t in s (Rabin-Karp)
int n = s.size(), m = t.size();
auto target = ht.query(0, m - 1);
vector<int> matches;
for (int i = 0; i + m - 1 < n; i++) {
if (hs.query(i, i + m - 1) == target)
matches.push_back(i);
}
cout << matches.size() << "\n";
for (int pos : matches) cout << pos + 1 << " ";
cout << "\n";
}Operations Reference
All operations assume the prefix hash array is already built. The
| Operation | Time | Space |
|---|---|---|
| Build prefix hashes (length | ||
| Query hash of substring | -- | |
| Compare two substrings | -- | |
| Rabin-Karp: find pattern | ||
| Hash + binary search (e.g., longest common prefix) | ||
| Hash all substrings of length | ||
| Build hash for two strings | ||
| Rebuild hash after single-char update | -- |
Problem Patterns & Tricks
Pattern 1: Rabin-Karp String Matching
Find all occurrences of pattern
cpp
auto target = ht.query(0, m - 1);
for (int i = 0; i + m <= n; i++)
if (hs.query(i, i + m - 1) == target)
// match at position iPattern 2: Longest Common Substring via Binary Search + Hashing
Binary search on length
cpp
auto check = [&](int k) -> bool {
unordered_set<long long> seen;
for (int i = 0; i + k <= n; i++)
seen.insert(combine(hs.query(i, i + k - 1)));
for (int i = 0; i + k <= m; i++)
if (seen.count(combine(ht.query(i, i + k - 1))))
return true;
return false;
};
// binary search on k in [0, min(n, m)]Total:
Problems: CF 580E
Pattern 3: Palindrome Check
A substring
cpp
// forward hash of s[l..r]
auto fwd = hs.query(l, r);
// reverse hash: position in reversed string is (n-1-r)..(n-1-l)
auto rev = hr.query(n - 1 - r, n - 1 - l);
bool is_palindrome = (fwd == rev);Pattern 4: Count Distinct Substrings
Hash every substring of every length, insert into a set. The set size is the answer.
cpp
set<pair<long long,long long>> distinct;
for (int len = 1; len <= n; len++)
for (int i = 0; i + len <= n; i++)
distinct.insert(hs.query(i, i + len - 1));
cout << distinct.size();Problems: CF 271D, SPOJ DISUBSTR
Pattern 5: Longest Common Prefix (LCP) with Binary Search
Binary search on the length of the common prefix of
cpp
int lcp(int i, int j) {
int lo = 0, hi = min(n - i, n - j);
while (lo < hi) {
int mid = lo + (hi - lo + 1) / 2;
if (hs.query(i, i + mid - 1) == hs.query(j, j + mid - 1))
lo = mid;
else
hi = mid - 1;
}
return lo;
}Problems: CF 126B
Binary search on LCP length using hash comparison:
s = "a b c a b d a b c a b e"
0 1 2 3 4 5 6 7 8 9 ...
LCP of suffix at i=0 vs suffix at i=6:
lo=0 hi=6
├─────────────────────────────┤
mid=3
hash(s[0..2]) vs hash(s[6..8])
"abc" == "abc" -> match! lo=3
mid=4
hash(s[0..3]) vs hash(s[6..9])
"abca" == "abca" -> match! lo=4
mid=5
hash(s[0..4]) vs hash(s[6..10])
"abcab" == "abcab" -> match! lo=5
mid=5 (converged)
-> LCP = 5
Each step: O(1) hash comparison. Total: O(log n).Pattern 6: Hashing + DP (String Periods, Repetitions)
Use hashing to check if a string has period
cpp
for (int p = 1; p <= n; p++)
if (n % p == 0 && hs.query(0, n - p - 1) == hs.query(p, n - 1))
// s has period pContest Cheat Sheet
+-------------------------------------------------------+
| STRING HASHING CHEAT SHEET |
+-------------------------------------------------------+
| WHEN TO USE |
| - Substring equality queries |
| - Pattern matching (Rabin-Karp) |
| - "Count distinct substrings" up to n ~ 3000 |
| - Palindrome checking with reverse hash |
| - Binary search on string properties (LCP, LCS) |
+-------------------------------------------------------+
| BUILD |
| pw[0]=1; for i in 0..n-1: |
| h[i+1] = (h[i]*B + s[i]) % M |
| pw[i+1] = pw[i]*B % M |
+-------------------------------------------------------+
| QUERY s[l..r] (0-indexed, inclusive) |
| (h[r+1] - h[l]*pw[r-l+1] % M + M) % M |
+-------------------------------------------------------+
| SAFE PARAMS |
| Double hash: MOD = 1e9+7, 1e9+9 |
| BASE: random in [256, 1e9] |
| Randomize at runtime to beat hacks |
+-------------------------------------------------------+
| COMPLEXITY Build: O(n) Query: O(1) |
+-------------------------------------------------------+Gotchas & Debugging
⚠️ Single hashing is NOT safe. With one hash and mod ~10^9, birthday paradox gives ~50% collision probability at ~30,000 strings. On Codeforces, adversarial tests WILL break single hashing.
Always use double hashing (two independent bases and mods):
cppconst long long MOD1 = 1e9 + 7, MOD2 = 1e9 + 9; const long long BASE1 = 131, BASE2 = 137; // Store pair {hash1, hash2} -- compare bothCollision probability with double hashing: ~1/(MOD1 * MOD2) ~= 10^{-18} per pair.
Also: When computing hash differences for substrings, add MOD before taking %:
(h[r] - h[l-1] * pw[r-l+1] % MOD + MOD) % MOD-- the subtraction can go negative.
1. Negative modular arithmetic. C++ % can return negative values when the dividend is negative. The expression h[r+1] - h[l] * pw[len] % MOD can go negative. Always add +MOD before the final %MOD:
cpp
(h[r+1] - h[l] * pw[len] % MOD + MOD) % MOD
// ^^^^2. Operator precedence trap.h[l] * pw[len] % MOD is parsed as (h[l] * pw[len]) % MOD because * and % have the same precedence and associate left-to-right. This is correct here. But if you write h[r+1] - h[l] * pw[len] + MOD) % MOD without the inner % MOD, the multiplication overflows long long. Always reduce after each multiply.
3. Collision probability -- don't use single hash in contests with hacking. A single mod __int128).
4. Choosing bases.
- Don't use
BASE = 26orBASE = 31-- they're the first thing anti-hash tests target. - Randomize at runtime:
BASE = uniform_int_distribution<long long>(256, 1e9)(rng). - Make sure
BASE > 256so all ASCII values are distinct modulo the base.
5. Off-by-one in prefix hash indexing. The prefix hash array is 1-indexed: h[0] = 0, h[i] = hash of s[0..i-1]. The substring s[l..r] uses h[r+1] and h[l]. Mixing 0-indexed and 1-indexed is the #1 source of bugs. Write a query(l, r) function and always go through it.
6. Comparing substrings from different strings. Both strings must use the same base and mod. Build both hash objects with identical parameters. If you randomize, generate the base once and pass it to both constructors.
7. Hash of empty string. By convention h[0] = 0. Querying query(l, l-1) (empty range) should return 0. Make sure your query function handles this or you never call it with invalid ranges.
8. TLE from unordered_set with bad hash.unordered_set<pair<long long,long long>> doesn't compile by default (no hash for pair). Either combine the two hashes into one long long via v1 * MOD2 + v2, or use a custom hash, or use map/set (with
Mental Traps
Trap 1: "If my hash passes the samples, it won't collide on the judge."
This is exactly backwards. Problem setters on Codeforces actively add anti-hash tests after seeing common base/mod choices in previous rounds. If you use base=31, mod=1e9+7 -- appearing in thousands of submissions -- your hash is already known. Adversarial inputs can be constructed to force collisions with known parameters. Always randomize the base at runtime and always use double hashing.
The adversary's workflow on Codeforces:
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Read your │ │ Find two │ │ Submit hack │
│ submission: │───>│ strings with │───>│ with those │
│ base=31 │ │ same hash │ │ strings as │
│ mod=1e9+7 │ │ under (31, │ │ test input │
│ │ │ 1e9+7) │ │ │
└──────────────┘ └──────────────┘ └──────────────┘
Defense: randomize base at runtime
mt19937 rng(chrono::steady_clock::now()...);
BASE = uniform_int_distribution<ll>(256, 1e9)(rng);
Adversary can't read your runtime-chosen base -> can't hack.Trap 2: "Double hashing is overkill for this problem."
It adds 5 lines of code and doubles a tiny constant factor. The alternative -- getting hacked or failing a hidden test -- wastes an entire submission and potentially your contest ranking. Single hashing is a coin flip on Codeforces; double hashing is insurance that costs nothing.
🐛 When Your Solution is Wrong
- WA (Wrong Answer):
- Hash collision! Single hash has ~
collision probability. Use double hashing (two different bases/mods) for collision chance. - Wrong modular arithmetic (negative remainders in some languages)
- Off-by-one in substring hash extraction
- Hash collision! Single hash has ~
- TLE (Time Limit Exceeded):
- Recomputing hash from scratch for each substring instead of using prefix hash array
- Using
string::substr()which copies --per call
- Anti-hack tip: Randomize the base at runtime:
base = rng() % (MOD - 2) + 2.
Common Mistakes
- Single hash collisions. Birthday paradox makes collisions likely above ~10^5 hashes. Always use double hashing (two independent base/mod pairs).
- Negative values from subtraction.
(a - b) % pcan be negative in C++. Fix:((a - b) % p + p) % p. - Wrong power precomputation.
pw[0]must be 1 (not base), and powers must be precomputed up to the full string length. - Hash comparison without collision awareness. Hash equality means "probably equal." Always verify with direct comparison in critical applications, or use double hashing.
- Anti-hash tests. Some competitive programming judges use anti-hash tests. Double hashing with large primes (e.g., 10^9 + 7 and 10^9 + 9) is the standard defense.
- Hardcoded base on adversarial judges. "I hashed every substring correctly but got Wrong Answer on test 37..." You used
base = 31andmod = 1e9+7-- the exact constants from the cp-algorithms template. The problem setter generated an anti-hash test that forces collisions for these specific values. Switching to a random base chosen at runtime (mt19937) and a second modulus made the solution robust. Never hardcode the base in a problem where adversarial tests are possible.
Debug Drills
Drill 1 -- Off-by-one in substring hash extraction
cpp
// Hash of s[l..r] (0-indexed, inclusive)
long long get_hash(int l, int r) {
return (h[r] - h[l] * pw[r - l + 1]) % MOD;
}What's wrong?
It should be h[r+1] - h[l] * pw[r - l + 1] (if h is 1-indexed prefix hash where h[i] = hash of s[0..i-1]). Or if h[r] means hash of first r+1 chars, then the subtraction index is wrong. The classic fix:
cpp
long long get_hash(int l, int r) {
return ((h[r + 1] - h[l] * pw[r - l + 1]) % MOD + MOD) % MOD;
}The + MOD) % MOD also prevents negative remainders.
Drill 2 -- Power array overflow
cpp
pw[0] = 1;
for (int i = 1; i <= n; i++)
pw[i] = pw[i - 1] * BASE; // no modular reduction!What's wrong?
Without % MOD, pw[i] overflows even long long for i > 40 or so. Every multiplication in the power array must be taken modulo MOD:
cpp
pw[i] = pw[i - 1] * BASE % MOD;Drill 3 -- Comparing hashes from different mod systems
cpp
if (get_hash1(l1, r1) == get_hash2(l2, r2)) // hash1 uses MOD1, hash2 uses MOD2
cout << "match";What's wrong?
You're comparing a hash computed modulo MOD1 with one computed modulo MOD2. They can be equal by coincidence even for completely different strings. Both hashes must use the same (base, mod) pair, or you must compare (hash1_mod1, hash1_mod2) == (hash2_mod1, hash2_mod2) for double hashing.
Self-Test
Before moving to problems, verify you can do these without looking at the notes:
- [ ] Implement double polynomial hashing from memory -- two hash arrays with two (base, mod) pairs, including
+ MODin subtraction and power precomputation - [ ] State the collision probability for single vs. double hash with
comparisons and , and explain why double hashing is adversary-proof - [ ] Write the formula to extract hash of
: , and explain why + MODis needed - [ ] Name two situations where hashing is the right tool (binary search on string properties, multiple substring equality checks) and two where exact algorithms are better (guaranteed correctness, single pattern search)
- [ ] State the minimum safe base for lowercase-only strings (
) and explain why base = 26causes non-collision false matches
Practice Problems
| CF Rating | What You Should Be Able to Do |
|---|---|
| 1200 | Compute a polynomial hash; check if a pattern occurs in a text (single hash). |
| 1500 | Use prefix hashes for |
| 1800 | Double hashing to resist anti-hash tests; rolling hash for sliding-window substring problems. |
| 2100 | Combine hashing with binary search (e.g., longest common substring); hash on trees or 2-D grids. |
| # | Problem | Source | Difficulty | Key Idea |
|---|---|---|---|---|
| 1 | Substring Search (CSES) | CSES | 1400 | Rabin-Karp pattern matching |
| 2 | Password (CF 126B) | CF | 1700 | Prefix = suffix = interior substring via hashing + Z/KMP |
| 3 | Good Substrings (CF 271D) | CF | 1500 | Count distinct substrings with character constraint |
| 4 | Palindrome Degree (CF 7D) | CF | 1700 | Recursive palindrome check with hashing |
| 5 | Yet Another Palindrome Problem (CF 1326D2) | CF | 1800 | Greedy + palindrome hashing |
| 6 | Kuro and GCD and XOR and SUM (CF 182D) | CF | 1500 | String period detection |
| 7 | Shortest Superstring (CF 25E) | CF | 2200 | Overlap via hashing + greedy |
| 8 | String Hashing (CSES) | CSES | 1400 | Direct application of polynomial hashing |
| 9 | Palindrome Pairs (CF 17E) | CF | 2400 | Hashing + combinatorics for palindrome counting |
| 10 | MUH and Cube Walls (CF 471D) | CF | 1800 | Difference array + Rabin-Karp matching |
Further Reading
- cp-algorithms: String Hashing -- Canonical reference with proofs and birthday paradox analysis.
- cp-algorithms: Rabin-Karp -- Pattern matching application in detail.
- Codeforces Blog: Anti-hash tests -- Why single hash gets hacked and how to defend.
- Codeforces Blog: Hashing tutorial -- Community tutorial with problem links.
- See:
02-kmp-and-z-function.md-- Deterministic alternatives when hashing is risky. - See:
04-suffix-array.md-- For problems where you need all substrings or LCP arrays at scale.
Solve With Me -- CF 271D "Good Substrings"
Problem: given a string
First thought: enumerate all substrings, count bad chars in each, and collect distinct ones. There are set<string> is
Plan: for each starting position unordered_set. At the end, the set size is the answer.
But single hashing has collision risk. With (base, mod) pairs, and store pair<long long, long long> in the set.
Wait, unordered_set doesn't hash pairs by default in C++. I need a custom hash or I can pack two 64-bit values into a single 128-bit key... or just use set<pair<long long, long long>> with
Coded it up, AC. The key insight was that hashing makes "distinct substring" queries trivial -- each substring is just a number (or pair of numbers), and set insertion handles uniqueness.
What to recognize next time: "Count distinct substrings with property X" -> hash each substring, throw hashes into a set. The per-substring property check is the easy part; hashing is what makes distinctness tractable without a suffix structure.
Recap
- Polynomial hash:
h(s) = sum(s[i] * base^i) mod p. Precompute prefix hashes for O(1) substring queries. - Always double-hash with two independent (base, mod) pairs to avoid collisions.
- Use cases: substring equality, distinct substring counting, string matching, longest common substring via binary search + hashing.
- Limitation: hashing is probabilistic. For hack-proof solutions, consider suffix arrays or KMP.
Cross-Links
- KMP and Z-Function -- deterministic O(n) pattern matching alternative
- Math Fundamentals -- modular arithmetic underlying hash computation
- Strings Practice Ladder -- graded problem set