Skip to content

Coordinate Compression

Quick Revisit

  • USE WHEN: Values are huge/sparse but only relative order matters—need to use them as array indices
  • INVARIANT: sort + deduplicate + lower_bound maps n values to [0, n) preserving order
  • TIME: O(n log n) to compress | SPACE: O(n)
  • CLASSIC BUG: Forgetting to add query points to the compression set before compressing
  • PRACTICE: Practice Ladder

Map n values from an arbitrarily large or sparse domain into a dense range [0,n) so that index-based structures (BIT, segment tree, counting array) can operate in O(n) space instead of O(max_val).

Difficulty: Intermediate | CF Rating: 1300–1700 | Prerequisites: Sorting and Searching, Hash Map and Unordered Map


Contents


Intuition

Consider these 6 coordinates:

cpp
a[] = { 1000000000, 3, 500, 1000000000, 3, 750 }

You need to use them as array indices—to update a Binary Indexed Tree or a counting array. The largest value is 109. Allocating an array that size means roughly 4 GB for a single int array; the memory limit is 256 MB.

Look at the waste. Between index 4 and index 499—nothing. Between 501 and 749—nothing. Between 751 and 999999999—nothing. Out of 109 slots, only 4 are occupied.

Only the relative order of values matters, not their magnitudes.

There are exactly 4 distinct values: {3,500,750,109}. Map them to consecutive integers 0,1,2,3. Now your BIT needs only 4 slots.

Analogy. A concert venue has seats numbered 1 through 106, but only 4 guests show up. You don't hand them seat 3, seat 500, seat 750, and seat 106. You hand them seat 0, seat 1, seat 2, seat 3—simple consecutive numbers. Every comparison ("is your seat before mine?") still gives the same answer.

Visual Walkthrough

text
  STEP 0 -- Original array
  +------------+---+-----+------------+---+-----+
  | 1000000000 | 3 | 500 | 1000000000 | 3 | 750 |
  +------------+---+-----+------------+---+-----+
       [0]      [1]  [2]      [3]      [4]  [5]

  STEP 1 -- Copy values, sort them
  sorted[]:
  +---+---+-----+-----+------------+------------+
  | 3 | 3 | 500 | 750 | 1000000000 | 1000000000 |
  +---+---+-----+-----+------------+------------+

  STEP 2 -- Remove consecutive duplicates (unique + erase)
  dict[]:
  +---+-----+-----+------------+
  | 3 | 500 | 750 | 1000000000 |
  +---+-----+-----+------------+
   [0]  [1]   [2]      [3]

  STEP 3 -- Map each original value via lower_bound on dict[]
  a[0] = 1000000000  -->  lower_bound finds dict[3]  -->  3
  a[1] = 3           -->  lower_bound finds dict[0]  -->  0
  a[2] = 500         -->  lower_bound finds dict[1]  -->  1
  a[3] = 1000000000  -->  lower_bound finds dict[3]  -->  3
  a[4] = 3           -->  lower_bound finds dict[0]  -->  0
  a[5] = 750         -->  lower_bound finds dict[2]  -->  2

  STEP 4 -- Result: compressed array
  +---+---+---+---+---+---+
  | 3 | 0 | 1 | 3 | 0 | 2 |
  +---+---+---+---+---+---+
   [0] [1] [2] [3] [4] [5]

  BIT / segment tree size needed: 4  (not 1000000001)

The Invariant

text
+------------------------------------------------------------------+
| compressed[i] < compressed[j]  iff  original[i] < original[j]   |
+------------------------------------------------------------------+

Coordinate compression is an order-preserving bijection from the set of distinct values to {0,1,,k1} where k is the number of distinct values. In the walkthrough above, original[1]=3<500=original[2], and indeed compressed[1]=0<1=compressed[2]. Every less-than, equal-to, and greater-than relationship survives the mapping. That is why any algorithm depending only on comparisons—BIT queries, inversion counting, segment tree range sums—produces identical results on compressed data.

What the Code Won't Teach You

The technique is about space, not correctness. Coordinate compression doesn't change what an algorithm does—it enables an algorithm that requires integer indices in a small range to operate on data with large values. The underlying algorithm (segment tree, BIT) is the actual tool; compression is the adapter that plugs oversized values into it.

text
  Without compression:              With compression:

  BIT size = max_val = 10^9         BIT size = n = 200000
  +---------------------------+     +-----------+
  | 99.9999% empty slots      |     | every     |
  | wasting 4 GB of memory    |     | slot used |
  +---------------------------+     +-----------+
  Memory: 4 GB  (MLE)              Memory: 800 KB  (OK)

When you need val - 1 and val + 1 in the compression set. This arises when queries involve strict inequalities. If you query "how many values r" and r is a query endpoint not in the data, including r in the compression set ensures it gets a rank. For "strictly less than r," include r1 as well.

text
  Worked example: why query values need compression

  data = [7, 2, 5, 5, 9, 2]    query: "count values <= 6"

  Compress data only: vals = [2, 5, 7, 9]
                      rank:   0  1  2  3
  Query "count <= 6": rank of 6 = ??? (6 not in vals!)
    lower_bound gives rank 2 (pointing at 7). WRONG.

  Compress data + queries: vals = [2, 5, 6, 7, 9]
                           rank:   0  1  2  3  4
  Query "count <= 6": rank of 6 = 2  -->  BIT prefix [0..2]
    Counts elements with rank 0,1,2 = values 2,5,6.
    Only 2 and 5 exist in data --> answer = 4 (two 2s + two 5s).

The lookup direction matters: value-to-rank vs. rank-to-value.lower_bound gives value-to-rank: "what rank does value x get?" For rank-to-value ("what original value has rank r?"), use vals[r] directly. Some problems need both directions—for instance, "output the k-th smallest distinct value" requires rank-to-value.

With the core mechanics covered, here is how to recognize when compression is the right tool.

When to Reach for This

Trigger phrases in problem statements:

  • "coordinates up to 109" (or 1018) with n2×105
  • "use values as array indices" but the value range far exceeds n
  • "segment tree on values" or "BIT on values"—not on positions
  • "count elements less than x" across an array
  • "offline queries" where all query values are known in advance

Counter-examples (compression not needed):

  • Values are already in [1,n] (e.g., a permutation). Use them directly.
  • You only need to compare values, never index into an array by value. A plain sort suffices.
  • The value range fits in memory (e.g., values in [0,106] with a generous limit). Direct indexing is simpler.

C++ STL Reference

Function / ClassHeaderWhat it doesTime Complexity
sort(first, last)<algorithm>Sorts range in-place (ascending by default)O(nlogn)
sort(first, last, comp)<algorithm>Sorts with custom comparatorO(nlogn)
unique(first, last)<algorithm>Removes consecutive duplicates; returns iterator to new logical endO(n)
vec.erase(first, last)<vector>Erases elements in range [first, last) from vectorO(n)
lower_bound(first, last, val)<algorithm>Iterator to first element val in sorted rangeO(logn)
upper_bound(first, last, val)<algorithm>Iterator to first element > val in sorted rangeO(logn)
distance(first, last)<iterator>Number of increments from first to lastO(1) for random-access
ranges::sort(range)<algorithm>C++20 ranges version of sortO(nlogn)

Key idiom: sortuniqueerase is the standard three-step for deduplicating a vector.

text
  lower_bound vs upper_bound on sorted vals = [2, 5, 5, 7, 9]:

  lower_bound(5):                    upper_bound(5):
  +---+---+---+---+---+             +---+---+---+---+---+
  | 2 | 5 | 5 | 7 | 9 |             | 2 | 5 | 5 | 7 | 9 |
  +---+---+---+---+---+             +---+---+---+---+---+
        ^                                        ^
     first >= 5                           first > 5
     rank = 1                             rank = 3

  For compression (after unique): vals = [2, 5, 7, 9]
  lower_bound(5) -> rank 1      (use for "rank of value")
  upper_bound(5) -> rank 2      (use for "count of values <= x")
cpp
sort(v.begin(), v.end());
v.erase(unique(v.begin(), v.end()), v.end());

Implementation (Contest-Ready)

Version 1: Minimal Contest Template

cpp
#include <bits/stdc++.h>
using namespace std;

int main() {
    int n;
    cin >> n;
    vector<int> a(n);
    for (auto& x : a) cin >> x;

    // Compress
    vector<int> vals(a);
    sort(vals.begin(), vals.end());
    vals.erase(unique(vals.begin(), vals.end()), vals.end());

    // Map each element to its compressed index
    vector<int> c(n);
    for (int i = 0; i < n; i++)
        c[i] = lower_bound(vals.begin(), vals.end(), a[i]) - vals.begin();

    // c[i] is in [0, vals.size())
    // To recover original value: vals[c[i]]

    // Example output
    for (int i = 0; i < n; i++)
        cout << c[i] << " \n"[i == n - 1];

    return 0;
}

Version 2: Explained Version with Reusable Function

cpp
#include <bits/stdc++.h>
using namespace std;

// Compresses values in-place. Returns the sorted unique values (the "dictionary").
// After this call, a[i] contains the compressed index in [0, dict.size()).
// To recover original value: dict[a[i]].
template <typename T>
vector<T> compress(vector<T>& a) {
    // Step 1: Copy all values into a separate vector.
    // We need the originals to stay intact during sorting.
    vector<T> dict(a);

    // Step 2: Sort + deduplicate.
    // After sort, duplicates are adjacent, so unique() collapses them.
    // erase() shrinks the vector to the new logical size.
    sort(dict.begin(), dict.end());
    dict.erase(unique(dict.begin(), dict.end()), dict.end());
    // dict now holds each distinct value exactly once, in sorted order.
    // dict.size() == number of distinct values == compressed range size.

    // Step 3: Replace each original value with its index in dict[].
    // lower_bound on a sorted array is O(log n), so this loop is O(n log n).
    for (auto& x : a) {
        // lower_bound returns iterator to the position of x in dict.
        // Subtracting dict.begin() converts to a 0-based index.
        x = lower_bound(dict.begin(), dict.end(), x) - dict.begin();
    }

    return dict;
}

// Compress multiple arrays that share the same value space.
// Example: data values + query values both need the same mapping.
template <typename T>
vector<T> compress_multi(vector<vector<T>*> arrays) {
    vector<T> dict;
    for (auto* arr : arrays)
        for (auto& x : *arr)
            dict.push_back(x);

    sort(dict.begin(), dict.end());
    dict.erase(unique(dict.begin(), dict.end()), dict.end());

    for (auto* arr : arrays)
        for (auto& x : *arr)
            x = lower_bound(dict.begin(), dict.end(), x) - dict.begin();

    return dict;
}

int main() {
    int n;
    cin >> n;
    vector<int> a(n);
    for (auto& x : a) cin >> x;

    // Single-array compression
    vector<int> dict = compress(a);

    cout << "Compressed: ";
    for (int i = 0; i < n; i++)
        cout << a[i] << " \n"[i == n - 1];

    cout << "Dictionary (" << dict.size() << " distinct values): ";
    for (int i = 0; i < (int)dict.size(); i++)
        cout << dict[i] << " \n"[i == (int)dict.size() - 1];

    return 0;
}

Operations Reference

OperationTimeSpace
Build dictionary (sort + unique + erase)O(nlogn)O(n)
Compress one element (lower_bound)O(logn)O(1)
Compress all n elementsO(nlogn)O(n)
Decompress one element (dict[c[i]])O(1)O(1)
Compress k arrays of total size NO(NlogN)O(N)

Here n = number of distinct values in the dictionary.


Problem Patterns & Tricks

Inversion Count via BIT

Compress values, then sweep left-to-right. For each element, query the BIT for how many already-inserted values are greater. The canonical application.

text
  Inversion count worked example: a = [4, 1, 3, 2]

  Step 1 -- Compress:  vals = [1,2,3,4]  ranks: 1->0, 2->1, 3->2, 4->3
    compressed a = [3, 0, 2, 1]

  Step 2 -- Sweep left-to-right with BIT:
    BIT = [0, 0, 0, 0]     (size = 4)

    i=0: a[0]=3  query(3)=0 inv from above=0  update(3,+1)
         BIT = [0, 0, 0, 1]
    i=1: a[1]=0  query(3)-query(0)=1  inv=1   update(0,+1)
         BIT = [1, 0, 0, 1]     inversions: (4,1)
    i=2: a[2]=2  query(3)-query(2)=1  inv=1   update(2,+1)
         BIT = [1, 0, 1, 1]     inversions: (4,3)
    i=3: a[3]=1  query(3)-query(1)=2  inv=2   update(1,+1)
         BIT = [1, 1, 1, 1]     inversions: (4,2),(3,2)

  Total inversions = 0 + 1 + 1 + 2 = 4
  Verify: (4,1),(4,3),(4,2),(3,2) = 4  <-- correct
cpp
// After compression, a[i] in [0, m)
long long inv = 0;
for (int i = 0; i < n; i++) {
    inv += query(m - 1) - query(a[i]); // count of inserted vals > a[i]
    update(a[i], 1);
}

Problems: CF 61E, CF 459D

Range Update / Point Query on Sparse Values

Values up to 109, but n2×105. Compress first, then build a BIT or segment tree of size n.

text
  Multi-array compression -- data + queries share one dictionary:

  data   = [100, 3, 50]     queries = [50, 200]
                  |                         |
                  v                         v
  combined = [3, 50, 100, 200]   (sort + unique)
  rank:       0   1    2    3

  compressed data    = [2, 0, 1]
  compressed queries = [1, 3]

  BIT size = 4   (not 200!)

Problems: CF 1311E, CF 1208D

"Count of Elements Less Than X" Offline

Sort queries and data together, compress all values into one dictionary, then sweep with a BIT or merge sort.

Problems: CF 1042D, CF 1324D

Coordinate Compression for 2D Sweep Line

Compress x-coordinates (or y-coordinates) before sweeping. Essential when coordinates reach 109 but there are only O(n) distinct values.

Problems: CF 1401E

Segment Tree with Lazy Propagation on Compressed Coordinates

When you need range operations on values rather than positions, compress the value space first, then build a segment tree of size O(n).

Problems: CF 915E

Longest Increasing Subsequence (LIS) via BIT

Compress values, then compute LIS in O(nlogn) using a BIT where BIT[v] is the length of the longest increasing subsequence ending at compressed value v.

cpp
// After compression: a[i] in [0, m)
vector<int> dp(m, 0);
// BIT query max in [0, a[i]-1], then update a[i] with result+1

Problems: CF 340D


Contest Cheat Sheet

text
+--------------------------------------------------------------+
|              COORDINATE COMPRESSION CHEAT SHEET               |
+--------------------------------------------------------------+
| WHEN TO USE:                                                  |
|   - Values up to 10^9, need BIT/segtree/counting array       |
|   - n <= 2*10^5 but value range is huge                       |
|   - "How many elements less/greater than X" type queries      |
+--------------------------------------------------------------+
| TEMPLATE (copy-paste):                                        |
|                                                               |
|   vector<int> vals(a);                                        |
|   sort(vals.begin(), vals.end());                             |
|   vals.erase(unique(vals.begin(), vals.end()), vals.end());   |
|   for (auto& x : a)                                          |
|       x = lower_bound(vals.begin(), vals.end(), x)            |
|           - vals.begin();                                     |
|                                                               |
+--------------------------------------------------------------+
| COMPLEXITY:                                                   |
|   Time:  O(n log n)   Space: O(n)                            |
+--------------------------------------------------------------+
| REMEMBER:                                                     |
|   - Compress ALL arrays sharing the value space               |
|   - Decompress: vals[compressed_index]                        |
|   - lower_bound for >=, upper_bound for >                     |
|   - BIT/segtree size = vals.size(), NOT original max          |
+--------------------------------------------------------------+

Gotchas & Debugging

  1. Forgetting to merge value spaces. If your data has values in a[] and your queries reference values too, compress both into the same dictionary. Mismatched dictionaries produce silently wrong indices.

  2. Off-by-one with lower_bound vs. upper_bound.

    • lower_bound → first position value → use for "rank of this value."
    • upper_bound → first position > value → use for "count of values x" (result = upper_bound - begin).
    • Mixing them up produces wrong answers with no compiler warning.
  3. Sorting the original instead of a copy. You need the original order for the actual algorithm. Always copy before sorting.

  4. Using 1-indexed BIT with 0-indexed compression. If your BIT is 1-indexed, shift every compressed value up: c[i] = lb_index + 1.

  5. Integer overflow in the original values. Compressed indices are small, so arithmetic on them is safe. But the original values can overflow int during preprocessing—use long long when values exceed 2×109.

  6. Duplicate handling. lower_bound maps all duplicates to the same index, which is usually what you want (rank-based equality). If you need unique indices per occurrence, you need tie-breaking by position (e.g., stable sort on (value, original_index) pairs).

  7. Empty input. If n=0, the dictionary is empty and lower_bound on an empty range is technically fine, but a BIT or segment tree of size 0 will segfault. Guard against it explicitly.

Debugging tip: Print the dictionary and compressed array; verify dict[c[i]] == original_a[i] for every i. This single check catches the vast majority of compression bugs.

Mental Traps

Trap: "Coordinate compression is just sorting."

text
  Compression is THREE steps, not one:

  sort          unique + erase      lower_bound
  +--------+    +--------+          +--------+
  | 3 3 5  | -> | 3 5 7  |  a[i] ->| rank   |
  | 5 7 7  |    +--------+         +--------+
  +--------+    deduplicate         rank-lookup

  Skip deduplication --> duplicate values get
  different ranks --> wrong answers.

Trap: "I'll compress and figure out the indexing later."

text
  WRONG approach:                     RIGHT approach:
  +---------------------------+       +---------------------------+
  | 1. Compress data values   |       | 1. Decide: 0-based or     |
  | 2. Build BIT              |       |    1-based indexing?       |
  | 3. Wait, BIT is 1-indexed |       | 2. Need val-1 or val+1    |
  |    but compression is     |       |    in compression set?     |
  |    0-indexed... add 1?    |       | 3. Collect ALL values      |
  |    Which values? Some?    |       |    (data + queries + adj)  |
  | 4. Debug for 2 hours      |       | 4. Compress + build        |
  +---------------------------+       +---------------------------+
  Design the indexing scheme BEFORE you compress.

Design your indexing scheme—0-based vs. 1-based, which sentinel values to include—before you call sort. A two-minute plan saves two hours of debugging.

Trap: "Compression and normalization are the same thing."

text
  Coordinate compression:          Normalization:
  +-------------------------+      +-------------------------+
  | values -> integer ranks |      | values -> [0.0, 1.0]   |
  | {3, 500, 750} -> {0,1,2}|      | {3, 500, 750} ->       |
  |                         |      |   {0.0, 0.665, 1.0}    |
  | Preserves ORDER only    |      | Preserves RATIOS        |
  | For integer indexing     |      | For real-valued tasks   |
  +-------------------------+      +-------------------------+

Self-Test

  • [ ] Write the full coordinate compression pipeline (sort + unique + erase + rank lookup) from memory, with no bugs in the unique + erase idiom.
  • [ ] Explain why both data values and query values must be included in the compression set, with a concrete example where omitting query values produces wrong results.
  • [ ] State when you need to include val - 1 or val + 1 in the compression set.
  • [ ] Given a compressed rank r, state how to recover the original value.
  • [ ] Explain the purpose of coordinate compression in one sentence.
  • [ ] Describe the difference between 0-indexed compression and 1-indexed BIT usage, and how to bridge the gap.
  • [ ] Given a = [100, 5, 100, 50], produce the compressed array by hand, showing each step.

Practice Problems

#ProblemSourceDifficultyKey Idea
1B - Sort the ArrayCF 451B1300Sorting + checking, warm-up for compression thinking
2D - Pair of TopicsCF 1324D1400Compress differences, count pairs with BIT/binary search
3D - Number of Segments with Big SumCF 1042D1600Prefix sums + compression + BIT for counting
4E - Construct the Binary TreeCF 1311E1600Compress + BIT for inversions during construction
5E - Inversions After ShuffleCF 749E1500Expected inversions, BIT on compressed values
6D - Serega and FunCF 459D1800Count inversions between two halves, compression required
7D - Restore PermutationCF 1208D1700Reconstruct permutation using BIT on compressed sums
8E - Physical Education LessonsCF 915E2000Interval operations on compressed coordinate space
9Nested Ranges CheckCSES1600Compress endpoints, check containment with BIT
10Nested Ranges CountCSES1700Count containing/contained ranges after compression
11Salary QueriesCSES1700Coordinate compress salaries for BIT point updates

Further Reading

The core insight: you don't need the actual values—only their relative order. Mapping n sparse values to [0,n) lets array-indexed structures operate on domains that would otherwise demand gigabytes of memory. The fingerprint to burn into memory: values up to 109 + need an array-index structure (BIT/segment tree) → coordinate compression.


Rating Progression

  • CF 1200: Compress values before counting with a frequency array; map large coordinates to dense indices for simple lookups.
  • CF 1500: Compress before BIT/segment tree for inversion counting or range queries on values.
  • CF 1800: Compress both endpoints of intervals for sweep-line algorithms; handle events on a compressed timeline.
  • CF 2100: Multi-dimensional compression for 2D segment trees; compress on-the-fly with dynamic segment trees as an alternative.

Before You Code Checklist

  • [ ] Collected all values that need compression (including query boundaries, not just array elements).
  • [ ] Sorted and deduplicated with sort + unique + erase (or used a set).
  • [ ] Used lower_bound to map original values to compressed indices—and verified 0-indexed vs. 1-indexed for BIT compatibility.
  • [ ] Confirmed the compressed index is used everywhere the original value was used as an index.
  • [ ] Checked whether the problem requires mapping back from compressed index to original value (keep the sorted unique array accessible).

What Would You Do If...?

  1. You need to compress coordinates but also need the gap sizes between consecutive compressed values—how would you preserve distance information?
  2. New values arrive online (not all known upfront)—can you still use static compression, or do you need a different approach?
  3. Two different arrays need compression—should you compress them independently or merge all values into one shared mapping?

The Mistake That Teaches

A contestant compressed array values for a BIT but forgot to compress the query boundaries too. The BIT was indexed [0,n), but queries used raw values like 5×108 as BIT indices—causing out-of-bounds access. The code crashed on the judge but "worked" locally on small tests where raw values happened to be small. The rule: compress everything that touches the indexed structure—data values and query parameters.

Debug This

Bug 1:

cpp
// Compress values in array a[]
vector<int> sorted_a = a;
sort(sorted_a.begin(), sorted_a.end());
sorted_a.erase(unique(sorted_a.begin(), sorted_a.end()), sorted_a.end());
for (int i = 0; i < n; i++)
    a[i] = lower_bound(sorted_a.begin(), sorted_a.end(), a[i]) - sorted_a.begin();
// Now use a[i] as 1-indexed BIT position
bit.update(a[i], 1);  // Bug here
Answer

lower_bound returns a 0-based index, but BITs typically require 1-based indexing. Use a[i] + 1 or a[i] = ... - sorted_a.begin() + 1 during compression.

Bug 2:

cpp
// Compress, then count inversions
vector<int> vals(a.begin(), a.end());
sort(vals.begin(), vals.end());
// Forgot to deduplicate!
for (int i = 0; i < n; i++)
    a[i] = lower_bound(vals.begin(), vals.end(), a[i]) - vals.begin();
Answer

Without unique + erase, duplicate values get different compressed indices (e.g., [3,3,5] compressed as [0,1,2] instead of [0,0,1]). This breaks any logic that depends on equal values mapping to the same index.

Bug 3:

cpp
// Compress coordinates for sweep line
vector<int> xs;
for (auto& seg : segments) {
    xs.push_back(seg.l);
    xs.push_back(seg.r);
}
sort(xs.begin(), xs.end());
xs.erase(unique(xs.begin(), xs.end()), xs.end());
// Query: how many segments cover point q?
int cq = lower_bound(xs.begin(), xs.end(), q) - xs.begin();
// Bug: q might not be in xs
Answer

If q was not added to xs before compression, lower_bound returns the position where q would be inserted, not a valid compressed index. All query points must be added to the compression set alongside segment endpoints.

Historical Origin

Coordinate compression is a discretization technique rooted in the computational geometry of the 1970s–80s, where sweep-line algorithms by Bentley and Ottmann needed to handle real-valued coordinates on integer-indexed structures. It became a competitive programming staple through USACO and IOI problems in the 2000s. A useful mnemonic to carry into contests: "Shrink the map, keep the rank—values change, order doesn't."

What to Solve Next

  • Greedy—the next core technique; coordinate compression often appears as a preprocessing step in greedy interval problems.
  • Practice Ladder—drill compression as a subroutine in BIT/segment-tree problems.

Built for the climb from Codeforces 1100 to 2100.