Hash Map and Unordered Map

Fast key-value lookups in $O (1)$ average. When the problem is really "remember what I've seen" or "jump straight to the bucket that owns this key," this is the default tool for contests.

Difficulty: Beginner | Prerequisites: Arrays and Strings | CF Rating Range: 800–2100

Intuition

You have six user IDs and need to find which ones appear more than once:

text

a = [5, 3, 8, 3, 12, 5]

The naive approach: compare each element with everything that comes after it.

text

i=0 (5):  compare with 3, 8, 3, 12, 5   -->  5 comparisons  (match at j=5)
i=1 (3):  compare with 8, 3, 12, 5       -->  4 comparisons  (match at j=3)
i=2 (8):  compare with 3, 12, 5          -->  3 comparisons  (no match)
i=3 (3):  compare with 12, 5             -->  2 comparisons  (already counted)
i=4 (12): compare with 5                 -->  1 comparison   (no match)
                                              ---------------
                                 Total:       15 comparisons

For $n = 6$ , that is $\frac{n (n - 1)}{2} = 15$ comparisons. For $n = 10^{5}$ , that balloons to roughly $5 \times 10^{9}$ —far too slow for a 2-second time limit. And this is just duplicate detection; frequency counting is equally painful with the same approach.

The PO Box Idea

If you can compute where a key lives in constant time, you never need to search for it.

Think of a post office with numbered PO boxes. When a letter arrives for box 137, the clerk doesn't inspect every box on the wall—she goes straight to 137. A hash map works the same way: a hash function $h (k)$ turns a key $k$ into a bucket index, and lookup becomes a direct jump instead of a scan.

The analogy has one crack: PO boxes have one owner each, but hash buckets can hold multiple keys. That is a collision, and it means the average $O (1)$ lookup can degrade to $O (chain length)$ . In the worst case, when everything lands in one bucket, every operation becomes $O (n)$ .

Trace: `[5, 3, 8, 3, 12, 5]` into 7 Buckets

Count the frequency of each element in a table with $m = 7$ buckets and hash $h (k) = k mod 7$ .

Step 1: Insert a[0] = 5. $h (5) = 5$ . Bucket 5 empty. Store 5 : 1.

text

  Bucket   Contents
  +-----+-----------+
  | [5] | 5 : 1     |  <-- inserted
  +-----+-----------+

Step 2: Insert a[1] = 3. $h (3) = 3$ . Bucket 3 empty. Store 3 : 1.

Step 3: Insert a[2] = 8. $h (8) = 1$ . Bucket 1 empty. Store 8 : 1.

text

  Bucket   Contents
  +-----+-----------+
  | [1] | 8 : 1     |  <-- inserted
  | [3] | 3 : 1     |
  | [5] | 5 : 1     |
  +-----+-----------+

Step 4: Insert a[3] = 3. $h (3) = 3 mod 7 = 3$ . Bucket 3 already has key 3 with count 1. Key matches—increment count to 2. Duplicate found!

text

  Bucket   Contents
  +-----+-----------+
  | [0] | (empty)   |
  | [1] | 8 : 1     |
  | [2] | (empty)   |
  | [3] | 3 : 2     |  <-- count updated (duplicate!)
  | [4] | (empty)   |
  | [5] | 5 : 1     |
  | [6] | (empty)   |
  +-----+-----------+
  ops so far: 4

Step 5: Insert a[4] = 12. $h (12) = 12 mod 7 = 5$ . Bucket 5 has key 5—but $12 \neq 5$ , so this is a collision. Chain key 12 after key 5 in the same bucket. Store 12 -> 1.

text

  Bucket   Contents
  +-----+---------------------+
  | [1] | 8 : 1               |
  | [3] | 3 : 2               |
  | [5] | 5 : 1  -->  12 : 1  |  <-- collision chain
  +-----+---------------------+

Step 6: Insert a[5] = 5. $h (5) = 5 mod 7 = 5$ . Bucket 5 has chain 5:1 -> 12:1. First entry matches key 5—increment count to 2. Duplicate found!

text

  Bucket   Contents
  +-----+---------------------+
  | [1] | 8 : 1               |
  | [3] | 3 : 2               |
  | [5] | 5 : 2  -->  12 : 1  |
  +-----+---------------------+

Final tally: 7 hash-table operations for 6 elements, versus 15 brute-force comparisons. With a good hash function, each insertion and lookup averages $O (1)$ , giving $O (n)$ total work. The gap only widens: brute force is $O (n^{2})$ ; the hash table stays $O (n)$ .

The Invariant

text

+-------------------------------------------------------------------+
| Every key k in the table lives in bucket h(k) % m.                |
| To find, insert, or update k, we only examine bucket h(k) % m     |
| and its collision chain — never the entire table.                 |
+-------------------------------------------------------------------+

This invariant holds because every insertion computes the hash before deciding where to store the key. No key is ever placed in the wrong bucket.

Look at Step 4: when we insert the second 3, we compute $h (3) = 3$ and go straight to bucket 3. The invariant guarantees the first 3 is already there, so we find it in one probe and increment its count. Without this invariant—if keys could drift to arbitrary buckets—we'd be back to scanning everything.

The invariant degrades when all keys hash to the same bucket. That single bucket's chain grows to length $n$ , and every operation becomes $O (n)$ . This is exactly what anti-hash attacks exploit: adversaries craft inputs so $h (k) mod m$ is identical for every key. The defense is a custom hash with random salt so the attacker cannot predict bucket assignments.

`operator[]` Creates Entries Silently

operator[] is a trap: it doesn't just read, it writes.

When you write if (m[key] == 0), you are not just checking—you are inserting key with value 0 if it did not exist. After that line, m.size() has grown by one. In a loop that tests millions of nonexistent keys, this silently fills the map with garbage entries and causes MLE or wrong answers.

Rule: use m.count(key), m.find(key) != m.end(), or C++20's m.contains(key) for existence checks. Never use m[key] unless you intend to create the entry.

Anti-Hash Attacks

The default std::hash<int> on many compilers is close to the identity function (it returns the key itself or a trivial transform). The STL typically uses a power-of-2 bucket count $m = 2^{k}$ , so the bucket index is hash(key) & (m - 1)—just the low-order bits. An attacker who knows $m$ can craft $n$ keys that all share the same low bits, forcing every key into one bucket:

text

  Normal distribution (good hash):       Anti-hash attack (identity hash):

  Bucket  Chain                           Bucket  Chain
  [0]  -> A                               [0]  -> K1 -> K2 -> ... -> Kn
  [1]  -> B -> C                          [1]  -> (empty)
  [2]  -> (empty)                         [2]  -> (empty)
  [3]  -> D                               [3]  -> (empty)
  [4]  -> E -> F                          ...
  [5]  -> (empty)                         [m-1]-> (empty)
  ...
  Avg lookup: O(1)                        Every lookup: O(n)  --> TLE!

  How it happens:
    default hash(k) ≈ k                (identity-like)
    bucket_count m  = 2^k              (power of 2)
    bucket index    = hash(k) & (m-1)  (low bits)
    Attacker picks 0, m, 2m, 3m, ...   (all share low bits)
    n = 10^5 keys × 10^5 chain = 10^10 ops --> TLE

This is not theoretical—it happens on Codeforces regularly. The defense is always the same: use a custom hash with a random salt so bucket assignments are unpredictable (see the custom hash implementation below).

Chaining vs. Open Addressing

The two standard collision strategies, side by side:

text

  CHAINING                                OPEN ADDRESSING
  Buckets     Linked lists                Buckets (inline storage)
  +-----+     +---+   +---+              +-----+-----+-----+-----+-----+
  | [0] | --> | A | ->| D |              |  A  |  B  | gap |  D  |  C  |
  +-----+     +---+   +---+              +-----+-----+-----+-----+-----+
  | [1] | --> | B |                       Probe sequence: h(k), h(k)+1, ...
  +-----+     +---+
  | [2] |     (empty)                     + Better cache locality
  +-----+                                 + No pointer chasing
  | [3] | --> | C |                       - Harder to delete
  +-----+     +---+                       - Degrades at high load factor

  + Simple, handles high load             See: gp_hash_table from <ext/pb_ds>
  - Pointer chasing = cache misses
  - High memory overhead per node

"unordered_map is always faster" is a common trap. Prefer map—a balanced BST with guaranteed $O (\log n)$ —when:

You need sorted iteration or lower_bound / upper_bound queries.
You need range queries ("how many keys in $[L, R]$ ?").
$n$ is small ( $\leq 10^{5}$ )—map can have better cache locality for small node counts and no hash overhead, so it can beat unordered_map in practice.
Contest safety—map has $O (\log n)$ worst case. No anti-hash attack can degrade it. When in doubt during a contest, map is the safer default.

text

  +---------------------------------------------------+
  |  DECISION: map vs unordered_map                   |
  +---------------------------------------------------+
  |  Need sorted order / lower_bound?                 |
  |    YES --> use map                                |
  |    NO  |                                          |
  |        v                                          |
  |  Keys are ints in [0, C), C <= 10^7?              |
  |    YES --> use vector<int> freq(C, 0)             |
  |    NO  |                                          |
  |        v                                          |
  |  n > 10^5 and need max speed?                     |
  |    YES --> unordered_map + custom_hash + reserve  |
  |    NO  |                                          |
  |        v                                          |
  |  Default: map (safe, predictable O(log n))        |
  +---------------------------------------------------+

STL Reference

Container	Header	Description	Avg Lookup	Worst Lookup
`unordered_map<K,V>`	`<unordered_map>`	Hash table, key-value pairs	$O (1)$	$O (n)$
`unordered_set<K>`	`<unordered_set>`	Hash table, keys only (no values)	$O (1)$	$O (n)$
`map<K,V>`	`<map>`	Balanced BST, key-value pairs	$O (\log n)$	$O (\log n)$
`set<K>`	`<set>`	Balanced BST, keys only	$O (\log n)$	$O (\log n)$

One pitfall to watch: after m.reserve(1 << 20) you have about 1M buckets. Calling m.clear() visits every bucket even if the map holds only 10 entries. In a tight loop that repeatedly fills and clears a map, this hidden cost can cause TLE. Alternatives: (a) swap with a fresh map (m = {};), or (b) track inserted keys in a side vector and erase only those.

Operation	Syntax	Notes
Insert / update	`m[key] = val;`	Creates entry if key absent (default-constructs value)
Insert without overwrite	`m.insert({key, val});`	No-op if key exists
Lookup (unsafe)	`m[key]`	Inserts default if missing! Don't use for existence checks
Lookup (safe)	`m.count(key)` or `m.find(key) != m.end()`	Does not insert
Lookup (C++20)	`m.contains(key)`	Cleanest existence check
Delete	`m.erase(key);`	No-op if key absent
Size	`m.size()`
Iterate	`for (auto& [k,v] : m)`	Structured bindings (C++17)
Reserve buckets	`m.reserve(n);`	Pre-allocate to avoid rehashing
Clear	`m.clear();`	$O (bucket_count)$ , not $O (size)$ —see Gotchas

Implementation

Frequency Counting

cpp

#include <bits/stdc++.h>
using namespace std;

int main() {
    ios::sync_with_stdio(false);
    cin.tie(nullptr);

    int n;
    cin >> n;
    unordered_map<int, int> freq;
    freq.reserve(n);

    for (int i = 0; i < n; i++) {
        int x;
        cin >> x;
        freq[x]++;
    }

    // Find the most frequent element
    int best_key = 0, best_cnt = 0;
    for (auto& [key, cnt] : freq) {
        if (cnt > best_cnt) {
            best_cnt = cnt;
            best_key = key;
        }
    }

    cout << best_key << " appears " << best_cnt << " times\n";
}

Custom Hash — Anti-Hack Safe

Use this in every CF contest where you reach for unordered_map or unordered_set. FIXED_RANDOM is computed once per run from the system clock, so the attacker cannot predict your bucket assignments.

cpp

#include <bits/stdc++.h>
using namespace std;

struct custom_hash {
    static uint64_t splitmix64(uint64_t x) {
        x += 0x9e3779b97f4a7c15;
        x = (x ^ (x >> 30)) * 0xbf58476d1ce4e5b9;
        x = (x ^ (x >> 27)) * 0x94d049bb133111eb;
        return x ^ (x >> 31);
    }

    size_t operator()(uint64_t x) const {
        static const uint64_t FIXED_RANDOM =
            chrono::steady_clock::now().time_since_epoch().count();
        return splitmix64(x + FIXED_RANDOM);
    }
};

// For pair<int,int> keys (e.g., coordinate mapping)
struct pair_hash {
    size_t operator()(pair<int,int> const& p) const {
        custom_hash h;
        auto a = h(p.first);
        auto b = h(p.second);
        return a ^ (b << 32 | b >> 32);
    }
};

int main() {
    ios::sync_with_stdio(false);
    cin.tie(nullptr);

    int n;
    cin >> n;

    unordered_map<long long, int, custom_hash> freq;
    freq.reserve(1 << 20);  // ~1M buckets, avoids rehashing on most contest sizes

    for (int i = 0; i < n; i++) {
        long long x;
        cin >> x;
        freq[x]++;
    }

    // Coordinate map — keys are pair<int,int>
    unordered_map<pair<int,int>, int, pair_hash> coord;
    coord.reserve(1 << 20);
    coord[{3, 5}] = 42;

    for (auto& [key, val] : freq)
        cout << key << " : " << val << "\n";
}

Operations Reference

Operation	`unordered_map` (avg)	`unordered_map` (worst)	`map`
Insert	$O (1)$	$O (n)$	$O (\log n)$
Lookup	$O (1)$	$O (n)$	$O (\log n)$
Delete	$O (1)$	$O (n)$	$O (\log n)$
`lower_bound`	N/A	N/A	$O (\log n)$
Iterate all	$O (n)$	$O (n)$	$O (n)$ sorted

Space Complexity

Container	Space	Notes
`unordered_map`	$O (n)$	~50-80 bytes per entry (high constant)
`map`	$O (n)$	~40-50 bytes per entry
`unordered_set`	$O (n)$	Less than map variant (no value stored)

Practical note: unordered_map uses 3–5× more memory than a plain array. If keys are small integers in $[0, C)$ , prefer a plain vector<int> freq(C, 0)—it is faster and uses less memory.

Problem Patterns

Frequency counting. "How many times does each element appear?" / "Most frequent." The base pattern.

cpp

unordered_map<int, int, custom_hash> freq;
for (int x : a) freq[x]++;

CF 4C—Registration System (1300). Frequency counting with string keys. CF 1676F—Longest Strike (1300). Frequency map + sweep.

Two-sum / complement lookup. For each element, check whether its complement already exists in the map. Check seen[target - x] before inserting x to avoid pairing an element with itself (unless the problem allows $i = j$ ).

cpp

unordered_map<int, int, custom_hash> seen;
long long ans = 0;
for (int x : a) {
    ans += seen[target - x];
    seen[x]++;
}

CF 1520D—Same Differences (1200). Transform to $a_{i} - i$ , count pairs. CSES 1640—Sum of Two Values.

Always clarify whether the problem wants ordered or unordered pairs, and whether $i = j$ is allowed. The hash map doesn't care—you have to.

Coordinate / state mapping. Map composite keys—pairs, tuples, strings—to values. Common in grid problems and state-space search.

cpp

unordered_map<pair<int,int>, int, pair_hash> last_seen;
// or use map<pair<int,int>, int> for simplicity

CF 1296C—Yet Another Walking Robot (1500). Coordinate pair as key, cycle detection.

First / last index tracking. Store the first (or most recent) occurrence index of each value.

cpp

unordered_map<int, int, custom_hash> first_idx;
for (int i = 0; i < n; i++) {
    if (!first_idx.contains(a[i]))
        first_idx[a[i]] = i;
}

CF 782A—Andryusha and Socks (800).

Canonical signature as key. Normalize each element to a canonical form, then group by that form.

cpp

// Group numbers by their "complement" prime factorization
map<vector<int>, int> sig_count;

CF 1225D—Power Products (1700). Prime signature as map key, complement lookup.

String set deduplication. Store substrings or hashes in a set to count distinct matches.

cpp

unordered_set<string> seen;
for (int i = 0; i < n; i++) {
    string sub = s.substr(i, len);
    if (is_good(sub)) seen.insert(sub);
}

CF 271D—Good Substrings (1500).

When Not to Use

Need sorted iteration or lower_bound / upper_bound. Hash maps have no ordering. Use map or set.
Keys are small integers in $[0, C)$ , $C \leq 10^{7}$ . A plain vector<int> freq(C, 0) is faster and uses far less memory.
Range queries ("how many keys fall in $[L, R]$ ?"). Hash tables cannot answer range queries. Use a balanced BST (set / map), Fenwick tree, or segment tree.
Contest with aggressive hacking and you are using default hash. You will get hacked. Switch to map or add custom_hash.

Gotchas

m[key] silently inserts. The #1 beginner bug—covered in detail above. Use count, find, or contains for existence checks.

Common Mistakes & Debugging

Forgetting .reserve() causes repeated rehashing. Without reserve, the container rebuilds from scratch each time it crosses the load-factor threshold—each rehash is $O (n)$ and can happen many times during a single insertion loop. m.reserve(1 << 20) avoids this for most contest sizes.

m.clear() is $O (bucket_count)$ , not $O (size)$ . After m.reserve(1 << 20), clearing visits $\sim$1M buckets even if only 5 keys were inserted. In a $T$ -test-case loop that clears a pre-reserved map each round, total cost is $O (T \cdot bucket_count)$ —easily TLE. Either skip reserve when clearing per test, declare a fresh unordered_map per case, or track inserted keys in a side vector and erase only those.

Iteration order is non-deterministic. unordered_map order depends on hash values and bucket count—do not rely on insertion order. If you need ordered output, collect into a vector and sort.

erase(key) removes the entire key, not one occurrence. This trips up anyone who treats a frequency map like a multiset:

cpp

unordered_map<int,int> freq;
for (int x : a) freq[x]++;
freq.erase(5);                        // wipes key 5 entirely
if (--freq[5] == 0) freq.erase(5);    // decrement then cleanup — correct

Erasing during iteration invalidates the iterator. Use the erase-return-iterator idiom:

cpp

for (auto it = m.begin(); it != m.end(); ) {
    if (should_remove(it->second))
        it = m.erase(it);
    else
        ++it;
}

No default hash for pair, tuple, or custom structs. You must provide a hash functor (see pair_hash in Implementation) or use map instead. A quick alternative for pair<int,int>: encode as a single long long with (uint64_t)a << 32 | (uint32_t)b.

Mental Traps

"unordered_map is always faster than map." For small $n$ or pointer-heavy workloads, map can win. A red-black tree node sits in one allocation; a hash table chains through bucket pointers, linked-list nodes, and potential rehashes. On $n = 10^{4}$ integer insertions you may see map win by 10–30% due to fewer cache misses and zero hashing overhead. The crossover where unordered_map reliably pulls ahead is typically around $n = 10^{5}$ – $10^{6}$ . Default to map in contests unless you've profiled a bottleneck and added custom_hash + reserve.

"O(1) average means safe." $O (1)$ average hides the $O (n)$ worst case. On Codeforces with hacking enabled, adversaries craft inputs that force every key into one bucket—turning your hash map into a linked list. The mechanism is detailed in Anti-Hash Attacks above. Always use custom_hash with a random salt on CF, or just use map for guaranteed $O (\log n)$ .

"clear() is cheap." m.clear() is $O (bucket_count)$ , not $O (size)$ . After m.reserve(1 << 20), clearing visits ~1M buckets even if only 5 keys were inserted. In a loop over $T$ test cases each calling clear() on a pre-reserved map, total cost is $O (T \cdot bucket_count)$ —easily TLE. Either skip reserve if you clear per test case, or erase keys manually from a tracked list.

Self-Test

Before moving on, verify you can answer these without looking back:

[ ] What happens when you write m[key] and key is not in the map? Why is this dangerous for existence checks?
[ ] Why does std::unordered_map with default hash get TLE on Codeforces? Describe the attack mechanism in one sentence.
[ ] Write the custom_hash struct from memory (or at least describe what splitmix64 + random salt achieves).
[ ] When should you prefer map over unordered_map? Name at least 3 situations.
[ ] What is the time complexity of m.clear() after m.reserve(1 << 20) with only 10 elements inserted? Why?
[ ] unordered_map has no default hash for pair<int,int>. What are your two options?
[ ] In the two-sum pattern, why do we check seen[target - x] before inserting x?

Practice Problems

#	Problem	Rating	Key Technique
1	782A—Andryusha and Socks	800	Set membership tracking
2	1520D—Same Differences	1200	Transform $a_{i} - i$ , count pairs via map
3	4C—Registration System	1300	Frequency counting with string keys
4	1676F—Longest Strike	1300	Frequency map + sweep
5	1296C—Yet Another Walking Robot	1500	Coordinate pair as key, cycle detection
6	271D—Good Substrings	1500	String set deduplication
7	1225D—Power Products	1700	Prime signature as map key
8	CSES 1640—Sum of Two Values	—	Complement lookup
9	CSES 1641—Sum of Three Values	—	Fix one, hash the other two
10	CSES 1642—Sum of Four Values	—	Pairwise sum hash map
11	CSES 1660—Subarray Sums I	—	Prefix sum + hash map

Hash Map and Unordered Map ​

Intuition ​

The PO Box Idea ​

Trace: [5, 3, 8, 3, 12, 5] into 7 Buckets ​

The Invariant ​

operator[] Creates Entries Silently ​

Anti-Hash Attacks ​

Chaining vs. Open Addressing ​

STL Reference ​

Implementation ​

Frequency Counting ​

Custom Hash — Anti-Hack Safe ​

Operations Reference ​

Space Complexity ​

Problem Patterns ​

When Not to Use ​

Gotchas ​

Common Mistakes & Debugging ​

Mental Traps ​

Self-Test ​

Practice Problems ​

Further Reading ​

Hash Map and Unordered Map

Intuition

The PO Box Idea

Trace: `[5, 3, 8, 3, 12, 5]` into 7 Buckets

The Invariant

`operator[]` Creates Entries Silently

Anti-Hash Attacks

Chaining vs. Open Addressing

STL Reference

Implementation

Frequency Counting

Custom Hash — Anti-Hack Safe

Operations Reference

Space Complexity

Problem Patterns

When Not to Use

Gotchas

Common Mistakes & Debugging

Mental Traps

Self-Test

Practice Problems

Further Reading