Sqrt Decomposition

Quick Revisit
USE WHEN: range queries + point updates and segment tree feels like overkill or is hard to apply
INVARIANT: array split into √n blocks; full blocks use precomputed answers, partial blocks are brute-forced
TIME: O(√n) per query and per update | SPACE: O(n)
CLASSIC BUG: last block is smaller than √n — forgetting to handle the partial tail block
PRACTICE: 08-ladder-mixed

AL-39 | Split an array into $O (\sqrt{n})$ blocks to answer range queries and point updates in $O (\sqrt{n})$ per operation -- the brute-force data structure that's often good enough.

Why this lives in the offline-techniques chapter. Sqrt decomposition itself is not an offline method -- it supports online point updates and range queries, and could equally well sit in Chapter 2. It is grouped here because (a) it is the canonical block-based fallback when segment trees feel like overkill, and (b) the block-decomposition mindset is the prerequisite for Mo's algorithm. Treat the chapter placement as a roadmap waypoint, not a claim that sqrt decomposition is fundamentally offline.

Difficulty: [Intermediate]Prerequisites: Arrays and Strings, Prefix SumsCF Rating Range: 1600 -- 2000

2. Intuition
3. C++ STL Reference
4. Implementation (Contest-Ready)
5. Operations Reference
6. Problem Patterns & Tricks
7. Contest Cheat Sheet
8. Gotchas & Debugging
9. Self-Test
10. Practice Problems
11. Further Reading

Intuition

2.1 The Pain

Consider an array of 16 elements. You need to support two operations: range sum queries and point updates.

  Index:   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
         +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
  a[]:   | 3 | 1 | 4 | 1 | 5 | 9 | 2 | 6 | 5 | 3 | 5 | 8 | 9 | 7 | 9 | 3 |
         +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+

The naive approach scans every element in the range: $O (n)$ per query. With $n = 10^{5}$ elements and $q = 10^{5}$ queries, that is $10^{10}$ operations -- far too slow for a 2-second time limit. Prefix sums handle static queries in $O (1)$ , but a single point update forces an $O (n)$ rebuild. We need something in between.

2.2 The Key Insight

Split the array into blocks of size $B \approx \sqrt{n}$ -- each query touches at most 2 partial blocks ( $O (\sqrt{n})$ each) and $O (\sqrt{n})$ full blocks with precomputed answers, giving $O (\sqrt{n})$ per query.

$B$ is a performance parameter, not a sacred constant. The default $B = ⌊ \sqrt{n} ⌋$ minimizes $B + n / B$ , which is the right balance when partial-block scans and full-block lookups have equal per-element cost. When the costs are asymmetric -- e.g., far more queries than updates, or partial-block work is $O (B^{2})$ instead of $O (B)$ -- the optimal $B$ shifts. It is worth tuning when you are close to the time limit; see the "block size choice" gotcha below.

Analogy. Imagine organizing books on a bookshelf. You group every $\sqrt{n}$ books into one shelf and label each shelf with the total number of books it holds. To count books in a range, you read the labels on all full shelves instantly, then count individual books on the two partial shelves at the ends by hand. The total work is proportional to $\sqrt{n}$ , not $n$ .

  Sqrt block structure (n=16, block_size=4):

  Array:  [ 3  1  4  1 | 5  9  2  6 | 5  3  5  8 | 9  7  9  3 ]
  Blocks:  +-block 0--+  +-block 1--+  +-block 2--+  +-block 3--+
  Sums:   [    9      |     22     |     21     |     28     ]

  Query [2..13]:
  partial |##|  full block  |  full block  | partial |##|
  [ 3  1 |4  1 | 5  9  2  6 | 5  3  5  8 | 9  7| 9  3 ]
         ^                                      ^
         L=2                                    R=13

  = brute(2..3) + block_sum[1] + block_sum[2] + brute(12..13)

For our 16-element array, $B = ⌊ \sqrt{16} ⌋ = 4$ , giving 4 blocks:

  Index:   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
         +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
  a[]:   | 3 | 1 | 4 | 1 | 5 | 9 | 2 | 6 | 5 | 3 | 5 | 8 | 9 | 7 | 9 | 3 |
         +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
         |   block 0     |   block 1     |   block 2     |   block 3     |
         |   sum = 9     |   sum = 22    |   sum = 21    |   sum = 28    |
         +---------------+---------------+---------------+---------------+

2.3 Visual Walkthrough

Step 1 -- Build block sums ( $O (n)$ )

Scan the array once, accumulating each element into its block:

  block[0] = a[0]+a[1]+a[2]+a[3]   = 3+1+4+1   =  9
  block[1] = a[4]+a[5]+a[6]+a[7]   = 5+9+2+6   = 22
  block[2] = a[8]+a[9]+a[10]+a[11] = 5+3+5+8   = 21
  block[3] = a[12]+a[13]+a[14]+a[15] = 9+7+9+3 = 28

Step 2 -- Range query: sum( $l = 2$ , $r = 13$ )

  Index:   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
         +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
  a[]:   | 3 | 1 |*4 |*1 |*5 |*9 |*2 |*6 |*5 |*3 |*5 |*8 |*9 |*7 | 9 | 3 |
         +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
         | block 0       | block 1       | block 2       | block 3       |
                  ^                                              ^
                  l=2                                            r=13

  (a) Partial left block (block 0, indices 2..3):
      Scan a[2] + a[3] = 4 + 1 = 5              cost: 2 elements

  (b) Full middle blocks (blocks 1 and 2):
      block[1] + block[2] = 22 + 21 = 43        cost: 2 block lookups

  (c) Partial right block (block 3, indices 12..13):
      Scan a[12] + a[13] = 9 + 7 = 16           cost: 2 elements

  Total = 5 + 43 + 16 = 64
  Verify: 4+1+5+9+2+6+5+3+5+8+9+7 = 64  <-- correct

Worst-case cost: $O (B) + O (n / B) + O (B)$ . With $B = \sqrt{n}$ this is $O (\sqrt{n})$ .

  Query complexity analysis:

  +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
  |  |##|  full blocks (at most sqrt(n))  |##|  |  |
  +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
       ^                                    ^
    partial                              partial
    (<= B elts)                         (<= B elts)

  Total = O(B) + O(n/B) + O(B) = O(B + n/B)
  Minimized when B = sqrt(n):  O(sqrt(n))

Step 3 -- Point update: set $a [5] = 1$

  Old a[5] = 9.  New a[5] = 1.  Difference = 1 - 9 = -8.
  block_id = 5 / 4 = 1.
  block[1] += -8  =>  22 - 8 = 14.
  a[5] = 1.

  After update:
         +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
  a[]:   | 3 | 1 | 4 | 1 | 5 | 1 | 2 | 6 | 5 | 3 | 5 | 8 | 9 | 7 | 9 | 3 |
         +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
         | block 0       | block 1       | block 2       | block 3       |
         | sum = 9       | sum = 14      | sum = 21      | sum = 28      |
                                ^
                              updated

  Cost: O(1)

Step 4 -- Query after update: sum( $l = 4$ , $r = 7$ )

  l and r are both in block 1.  Same-block case: just scan.

  a[4] + a[5] + a[6] + a[7] = 5 + 1 + 2 + 6 = 14
  (Matches the updated block[1] = 14.)

  Cost: O(B) = O(sqrt(n))

2.4 The Invariant

  +--------------------------------------------------------------+
  | INVARIANT: block[b] always stores the aggregate (sum / min / |
  | max) of the elements a[b*B .. min((b+1)*B-1, n-1)].         |
  |                                                              |
  | - Point update maintains this in O(1) for sum (O(B) for     |
  |   min/max, since you must rescan the block).                 |
  | - Range query decomposes [l,r] into at most O(sqrt(n))       |
  |   blocks plus two partial tails of O(sqrt(n)) elements each. |
  +--------------------------------------------------------------+

Every operation preserves the invariant. Build sets it. Point update restores it by adjusting a single block. Queries rely on it to skip full blocks.

2.5 When to Reach for This

Trigger phrases in problem statements:

"range queries + updates" with $n, q \leq 2 \times 10^{5}$ -- sqrt decomp is the simplest solution.
" $O (\sqrt{n})$ per query acceptable" -- you see the constraint math working out.
The aggregate operation is "too complex for segment tree" -- e.g., maintaining a sorted order per block, mode queries, or count-distinct queries.

Counter-examples (when sqrt decomp is NOT the right tool):

You need $O (\log n)$ per operation -- use a segment tree or BIT instead.
All queries are offline and you want $O ((n + q) \sqrt{n})$ with add/remove -- Mo's algorithm is the sqrt technique for that (see Mo's Algorithm).
$n \geq 10^{6}$ and tight time limit -- $O (\sqrt{n}) \approx 1000$ per query may be too slow; reach for $O (\log n)$ structures.

What the Code Won't Teach You

When sqrt beats everything else. Several competition problems have no known $O (n \log n)$ solution but have $O (n \sqrt{n})$ sqrt decomposition solutions. This happens with offline queries where the operation isn't mergeable, problems where the answer can't be maintained incrementally, or cases where the segment tree constant is too large. Sqrt is often the "fallback" that works when the "clean" data structure doesn't exist.

The block decomposition mindset. Sqrt isn't just a data structure -- it's a design pattern. You can apply it to decompose algorithms (Mo's algorithm), data (heavy-light decomposition), or complexity (sqrt trick for factorials). The unifying idea is: partition into $O (\sqrt{n})$ groups, precompute per-group summaries, handle cross-group queries by combining summaries. Recognizing this pattern in unfamiliar contexts is the advanced sqrt skill.

The "two levels of granularity" intuition. Every sqrt-decomposed structure maintains two views of the same data: the fine-grained element level and the coarse-grained block level. Updates touch one element and one block. Queries combine partial element-level scans with block-level summaries. Getting comfortable thinking at both levels simultaneously -- and knowing when to drop to element level vs. stay at block level -- is the core skill that the template code obscures.

C++ STL Reference

Function / Class	Header	What it does	Time Complexity
`vector<T>`	`<vector>`	Underlying array storage	$O (1)$ access
`fill(first, last, val)`	`<algorithm>`	Initialize block aggregates	$O (n)$
`accumulate(first, last, init)`	`<numeric>`	Sum a range (useful for building blocks)	$O (n)$
`min_element(first, last)`	`<algorithm>`	Find min in a partial block	$O (k)$
`max_element(first, last)`	`<algorithm>`	Find max in a partial block	$O (k)$
`sqrtl(n)` / `(int)sqrt((double)n)`	`<cmath>`	Compute block size	$O (1)$
`__builtin_popcount(x)`	built-in	Useful in bitwise sqrt-decomp variants	$O (1)$

Key detail: There is no STL "sqrt decomposition" container. You build it yourself with a vector for the array and a vector for block aggregates. That's the whole point -- it's simple enough to code from scratch in 2 minutes.

Implementation (Contest-Ready)

Same-Block Queries Need Special Handling

When query range [l, r] falls entirely within one block, the "process full blocks" loop runs zero times. You must handle this case separately:

Query [3, 6] with block_size = 4:
Block 0: [0,3], Block 1: [4,7], Block 2: [8,11]

l=3 is in block 0, r=6 is in block 1 -> different blocks -> normal path.

Query [5, 7] -- both in block 1:
  DON'T process block 1 as a full block (it's partial on both sides).
  Instead, iterate elements 5, 6, 7 directly.

cpp

int lb = l / B, rb = r / B;
if (lb == rb) {
    // Same block -- brute force elements [l, r]
    for (int i = l; i <= r; i++) ans += a[i];
} else {
    // Partial left block + full middle blocks + partial right block
    ...
}

Block rebuild on point update: After updating a[i], you must recompute block_sum[i / B] (or whatever aggregate your blocks store). Forgetting this means queries return stale block values.

  Point update with sqrt blocks:

  Update arr[6] = 10 (was 2):

  Before: [ 3  1  4  1 | 5  9  2  6 | 5  3  5  8 | 9  7  9  3 ]
  Sums:   [    9      |     22     |     21     |     28     ]
                             ^
                        block 1 contains index 6

  After:  [ 3  1  4  1 | 5  9 *10* 6 | 5  3  5  8 | 9  7  9  3 ]
  Sums:   [    9      |     30     |     21     |     28     ]
                            +8 (diff applied to block sum)

Version 1: Range Sum Queries + Point Update -- Minimal Template

cpp

#include <bits/stdc++.h>
using namespace std;

int main() {
    int n, q;
    cin >> n >> q;
    vector<long long> a(n);
    for (auto& x : a) cin >> x;

    int B = max(1, (int)sqrt(n));
    int nb = (n + B - 1) / B;
    vector<long long> bs(nb, 0);
    for (int i = 0; i < n; i++)
        bs[i / B] += a[i];

    // Point update: a[i] = val
    auto update = [&](int i, long long val) {
        bs[i / B] += val - a[i];
        a[i] = val;
    };

    // Range sum query: sum of a[l..r]
    auto query = [&](int l, int r) -> long long {
        long long res = 0;
        int bl = l / B, br = r / B;
        if (bl == br) {
            for (int i = l; i <= r; i++) res += a[i];
        } else {
            for (int i = l; i < (bl + 1) * B; i++) res += a[i];
            for (int b = bl + 1; b < br; b++) res += bs[b];
            for (int i = br * B; i <= r; i++) res += a[i];
        }
        return res;
    };

    while (q--) {
        int t; cin >> t;
        if (t == 1) {
            int i; long long v; cin >> i >> v;
            update(i, v);
        } else {
            int l, r; cin >> l >> r;
            cout << query(l, r) << "\n";
        }
    }
}

Version 2: Range Sum Queries + Point Update -- Explained

cpp

#include <bits/stdc++.h>
using namespace std;

int main() {
    int n, q;
    cin >> n >> q;
    vector<long long> a(n);
    for (auto& x : a) cin >> x;

    // Block size: sqrt(n), but at least 1 to avoid division by zero.
    int B = max(1, (int)sqrt(n));
    // Number of blocks: ceil(n / B).
    int nb = (n + B - 1) / B;

    // bs[b] = sum of elements in block b.
    vector<long long> bs(nb, 0);
    for (int i = 0; i < n; i++)
        bs[i / B] += a[i];

    // Point update: set a[i] = val.
    // Adjust the block sum by the difference (val - old value).
    // Cost: O(1).
    auto update = [&](int i, long long val) {
        bs[i / B] += val - a[i];
        a[i] = val;
    };

    // Range sum query: return a[l] + a[l+1] + ... + a[r].
    // Three cases:
    //   1) l and r are in the same block -> just scan.
    //   2) Otherwise:
    //      a) Scan partial left block: l to end of its block.
    //      b) Sum full middle blocks using bs[].
    //      c) Scan partial right block: start of its block to r.
    // Cost: O(sqrt(n)).
    auto query = [&](int l, int r) -> long long {
        long long res = 0;
        int bl = l / B, br = r / B;
        if (bl == br) {
            // Same block: just iterate.
            for (int i = l; i <= r; i++) res += a[i];
        } else {
            // Left partial block: l to end of block bl.
            for (int i = l; i < (bl + 1) * B; i++) res += a[i];
            // Full middle blocks.
            for (int b = bl + 1; b < br; b++) res += bs[b];
            // Right partial block: start of block br to r.
            for (int i = br * B; i <= r; i++) res += a[i];
        }
        return res;
    };

    while (q--) {
        int t; cin >> t;
        if (t == 1) {
            // Type 1: point update. Read index i and new value v.
            int i; long long v; cin >> i >> v;
            update(i, v);
        } else {
            // Type 2: range sum query. Read l and r (0-indexed, inclusive).
            int l, r; cin >> l >> r;
            cout << query(l, r) << "\n";
        }
    }
}

Operations Reference

Operation	Time	Space
Build (precompute block sums)	$O (n)$	$O (\sqrt{n})$ for block array
Point update `a[i] = val`	$O (1)$	$O (1)$
Range sum / min / max query $[l, r]$	$O (\sqrt{n})$	$O (1)$
Range update (add $v$ to $[l, r]$ ) with lazy	$O (\sqrt{n})$	$O (\sqrt{n})$ for lazy array
Range update + range query (lazy blocks)	$O (\sqrt{n})$ per op	$O (\sqrt{n})$
Total for $q$ queries on array of size $n$	$O ((n + q) \sqrt{n})$	$O (n)$

When is this fast enough? For $n, q \leq 10^{5}$ , cost is $\approx 10^{5} \cdot 316 \approx 3 \times 10^{7}$ . Comfortable. For $n, q = 5 \times 10^{5}$ , cost is $\approx 3.5 \times 10^{8}$ . Tight but often passes with fast I/O and simple operations.

Problem Patterns & Tricks

Pattern 1: Range Query + Point Update (Basic)

The bread-and-butter pattern. Maintain block aggregates (sum, min, max, gcd) and answer range queries in $O (\sqrt{n})$ . Point updates are $O (1)$ -- just update the element and the block aggregate.

Use this when the problem has $\leq 2 \times 10^{5}$ queries and you don't want to code a segment tree.

Examples: CF 13E -- Holes, CF 797F -- Mice and Holes

Pattern 2: Range Update + Range Query (Lazy Blocks)

Add a lazy[b] value per block. A range update "add $v$ to $[l, r]$ " applies $v$ to individual elements in partial blocks and adds $v$ to lazy[b] for full blocks. Queries add lazy[b] to each block sum.

cpp

// Range add v to [l, r]:
// partial blocks: a[i] += v, rebuild block sum
// full blocks: lazy[b] += v
// Query a[l..r]:
// partial blocks: a[i] + lazy[block(i)]
// full blocks: bs[b] + lazy[b] * B

Examples: CF 52C -- Circular RMQ, CF 242E -- XOR on Segment

Pattern 3: Mo's Algorithm (Offline Sqrt Decomposition on Queries)

Sort $q$ queries $[l, r]$ by $(l / B, r)$ -- block of $l$ first, then $r$ . Process them by adding/removing one element at a time. Total pointer movement is $O ((n + q) \sqrt{n})$ .

This is the most important application of sqrt decomposition in competitive programming. See Mo's Algorithm for the full treatment.

Examples: CF 86D -- Powerful Array, CF 617E -- XOR and Favorite Number

Pattern 4: Threshold / Heavy-Light by Frequency

Partition values into "heavy" (frequency $\geq \sqrt{n}$ ) and "light" (frequency $< \sqrt{n}$ ). There are at most $\sqrt{n}$ heavy values. Precompute answers for heavy values; handle light values with brute force.

This is a classic sqrt trick for count/frequency problems.

Examples: CF 375D -- Tree and Queries, CF 840D -- Destiny

Pattern 5: Sorted Blocks (Insert / Delete / k-th element)

Maintain each block as a sorted list. Insert/delete costs $O (\sqrt{n})$ per operation (rebuild the block). k-th element query: binary search across block sizes. Good for dynamic order statistics without a balanced BST.

cpp

// To find k-th element across sorted blocks:
for (int b = 0; b < nb; b++) {
    if (k <= (int)block[b].size()) {
        return block[b][k - 1]; // 1-indexed k
    }
    k -= block[b].size();
}

Examples: CF 455D -- Serega and Fun

Pattern 6: Batch Processing (Sqrt Buffering)

Accumulate updates in a buffer. Every $\sqrt{n}$ updates, flush the buffer and rebuild the structure. Queries check both the structure and the buffer.

This amortizes expensive rebuilds. Total cost for $q$ operations: $O (q \sqrt{n})$ .

Examples: CF 342E -- Xenia and Tree

Contest Cheat Sheet

+------------------------------------------------------------------+
|                  SQRT DECOMPOSITION CHEAT SHEET                  |
+------------------------------------------------------------------+
| WHEN TO USE:                                                     |
|   - Range queries + point/range updates, n <= 2e5               |
|   - Segment tree is overkill or hard to implement                |
|   - Mo's algorithm (offline queries)                             |
|   - Threshold trick: split into heavy/light by sqrt(n)           |
+------------------------------------------------------------------+
| TEMPLATE (range sum + point update):                             |
|                                                                  |
|   int B = max(1, (int)sqrt(n));                                  |
|   vector<long long> bs((n+B-1)/B, 0);                           |
|   for (int i=0;i<n;i++) bs[i/B] += a[i];  // build             |
|                                                                  |
|   // update a[i] = v:                                            |
|   bs[i/B] += v - a[i]; a[i] = v;                                |
|                                                                  |
|   // query sum [l,r]:                                            |
|   scan partial left + sum full blocks + scan partial right       |
+------------------------------------------------------------------+
| COMPLEXITY:  O(sqrt(n)) per query/update                         |
| SPACE:       O(n) array + O(sqrt(n)) block sums                 |
+------------------------------------------------------------------+
| REMEMBER:    Last block may be smaller than B                    |
|              Block of index i = i / B                            |
|              Block b covers indices [b*B, min((b+1)*B-1, n-1)]   |
+------------------------------------------------------------------+

Gotchas & Debugging

Block size choice

Always use int B = max(1, (int)sqrt(n)). The max(1, ...) prevents division by zero when $n = 0$ . Don't use ceil(sqrt(n)) -- it gives slightly worse constant in some cases, and floor is the standard choice.

For some problems, tuning $B$ to $\sqrt{n \cdot 2 / 3}$ or a fixed constant (300--500) can help pass tight time limits. Profile if needed.

Last incomplete block

The last block may contain fewer than $B$ elements. Your query loop must not read past index $n - 1$ . The safe pattern:

cpp

for (int i = br * B; i <= r; i++) ...  // r < n, so this is safe

But if you ever iterate over "all elements in block $b$ ", bound it:

cpp

int lo = b * B;
int hi = min((b + 1) * B - 1, n - 1);
for (int i = lo; i <= hi; i++) ...

Same-block edge case

When $l$ and $r$ are in the same block, do not run the three-part (left partial + middle + right partial) logic. Just iterate from $l$ to $r$ . Forgetting this causes double-counting.

Rebuilding block aggregate

After a point update, you must update the block aggregate. For sum, the $O (1)$ trick is bs[i/B] += new_val - old_val. For min/max, you may need to rescan the entire block in $O (B)$ . If the problem requires min/max, account for this cost.

Integer overflow

Block sums can be large. If each element is up to $10^{9}$ and block size is $\sqrt{n} \approx 450$ , a block sum can reach $4.5 \times 10^{11}$ . Use long long.

Off-by-one with 1-indexed input

Many CF problems use 1-indexed arrays. Convert to 0-indexed immediately, or adjust block formulas: block = (i - 1) / B for 1-indexed.

Mental Traps

Trap: "Segment tree is always better than sqrt decomposition."

Segment trees have $O (\log n)$ per operation vs. $O (\sqrt{n})$ , but sqrt decomposition has a much smaller constant factor. For $n = 10^{5}$ with tight TL, a well-implemented sqrt solution can be faster in practice. More importantly, sqrt decomposition handles operations that segment trees cannot -- e.g., arbitrary range reordering, operations without a good "merge" function.

Trap: "The block size is always $\sqrt{n}$ ."

The optimal $B$ minimizes $B + n / B$ , which gives $\sqrt{n}$ only when the cost-per-element equals the cost-per-block. If partial-block processing costs $O (B^{2})$ (e.g., re-sorting a block), the balance shifts: minimize $B^{2} + n / B$ -> $B = n^{1 / 3}$ . When in doubt, try a few values empirically.

Trap: "Sqrt decomposition is trivial to implement."

The idea is trivial. The boundary cases (same block, left partial, right partial), lazy updates for range modifications, and block rebuilds on updates are all fertile ground for off-by-one errors. Most WAs in sqrt decomposition come from boundary handling, not algorithmic misunderstanding.

Trap: "I should always use sqrt decomposition for range queries."

For static arrays, prefix sums give $O (1)$ queries. For $n > 10^{6}$ with tight time limits, $O (\sqrt{n}) \approx 1000$ per query may be too slow -- use $O (\log n)$ structures instead. Sqrt decomposition lives in the sweet spot: dynamic arrays with $n \leq 2 \times 10^{5}$ and operations that don't fit cleanly into a segment tree.

The mistake that teaches

RMQ with same-block queries. I implemented sqrt decomposition with $B = 317$ . It worked on small tests, then WA on test 7. When l and r are in the same block, my code processed them as "partial left block + full middle blocks + partial right block" -- but there were no full middle blocks, and the two partial pieces overlapped, double-counting elements. Fix: if l/B == r/B, brute-force from l to r directly. The "left tail + full blocks + right tail" decomposition assumes at least one full block in between.

CF 13E Holes (hidden chain inside rebuild). For each cell, precompute where you exit the current block and how many jumps it takes. My rebuild had a nested loop -- for each cell, I followed the jump chain to the block boundary, which could be $O (B)$ per cell, making rebuild $O (B^{2}) = O (n)$ . Total: $O (n \cdot Q)$ . TLE. Fix: process cells within the block in reverse order. Cell $i$ 's exit info depends on cell $i + jump [i]$ . If that's still in the same block, copy its precomputed answer; otherwise the answer is the direct jump. This makes rebuild truly $O (B)$ . Sqrt decomposition is $O (\sqrt{n})$ only if your block operations are genuinely $O (1)$ or $O (B)$ .

Debug Drills

Drill 1: Off-by-one in block boundary computation.

cpp

int B = (int)sqrt(n);
vector<long long> block_sum(n / B + 1, 0);
for (int i = 0; i < n; i++)
    block_sum[i / B] += a[i];

long long query(int l, int r) {
    long long res = 0;
    int bl = l / B, br = r / B;
    if (bl == br) {
        for (int i = l; i <= r; i++) res += a[i];
        return res;
    }
    for (int i = l; i < (bl + 1) * B; i++) res += a[i]; // left tail
    for (int b = bl + 1; b < br; b++) res += block_sum[b]; // full blocks
    for (int i = br * B; i <= r; i++) res += a[i]; // right tail
    return res;
}

What's wrong?

When n is not divisible by B, the expression n / B + 1 may allocate one too few blocks. For example, if n = 10 and B = 3, the blocks are {0,1,2}, {3,4,5}, {6,7,8}, {9} -- that's 4 blocks, but 10/3 + 1 = 4 happens to work. However, if n = 9 and B = 3, you get 9/3 + 1 = 4 blocks but only need 3. The real danger: if B = (int)sqrt(n) rounds down to 0 when n = 0, you get division by zero. Fix: use B = max(1, (int)sqrt(n)) and allocate (n + B - 1) / B blocks.

Drill 2: Forgetting to apply lazy tag before reading individual elements.

cpp

void range_add(int l, int r, int val) {
    int bl = l / B, br = r / B;
    if (bl == br) {
        for (int i = l; i <= r; i++) a[i] += val;
        block_sum[bl] += (long long)val * (r - l + 1);
        return;
    }
    for (int i = l; i < (bl + 1) * B; i++) a[i] += val;
    block_sum[bl] += (long long)val * ((bl + 1) * B - l);
    for (int b = bl + 1; b < br; b++) lazy[b] += val;
    for (int i = br * B; i <= r; i++) a[i] += val;
    block_sum[br] += (long long)val * (r - br * B + 1);
}

int point_query(int idx) {
    return a[idx]; // BUG
}

What's wrong?

point_query returns a[idx] without adding lazy[idx / B]. The lazy tag stores a pending addition that hasn't been applied to individual elements. Fix: return a[idx] + lazy[idx / B];. Alternatively, push the lazy tag to all elements in the block before accessing them.

Drill 3: Rebuilding sorted block incorrectly after update.

cpp

// Each block maintains a sorted copy for order-statistic queries
void update(int pos, int new_val) {
    int b = pos / B;
    a[pos] = new_val;
    // Rebuild sorted block
    sorted_block[b].clear();
    for (int i = b * B; i < (b + 1) * B; i++)
        sorted_block[b].push_back(a[i]);
    sort(sorted_block[b].begin(), sorted_block[b].end());
}

What's wrong?

The rebuild loop iterates from b*B to (b+1)*B, but for the last block, (b+1)*B may exceed n. This reads uninitialized or out-of-bounds memory. Fix: use min((b+1)*B, n) as the upper bound: for (int i = b*B; i < min((b+1)*B, n); i++).

Drill 4: Inclusive vs exclusive right boundary in partial block.

cpp

int B = sqrt(n);
for (int i = 0; i < n; i++)
    block_id[i] = i / B;

// Query [l, r]:
if (block_id[l] == block_id[r]) {
    // brute force
} else {
    // partial left block: l to end of block
    for (int i = l; i < (block_id[l] + 1) * B; i++) // ...
    // full blocks
    for (int b = block_id[l] + 1; b < block_id[r]; b++) // ...
    // partial right block: start of block to r
    for (int i = block_id[r] * B; i < r; i++) // ...  BUG HERE
}

What's wrong?

The right partial block loop uses i < r but should use i <= r. The range is [l, r] inclusive, so position r itself must be included. This off-by-one means the rightmost element of every query is silently excluded -- producing answers that are slightly wrong.

Drill 5: Block size when n is small.

cpp

int B = sqrt(n);
int num_blocks = (n + B - 1) / B;
vector<long long> block_sum(num_blocks, 0);
for (int i = 0; i < n; i++)
    block_sum[i / B] += a[i];

What's wrong?

When n = 0, B = sqrt(0) = 0, and i / B causes division by zero. Even for small n like 1-3, sqrt(n) might return 1, which works but leads to degenerate behavior. Always guard: int B = max(1, (int)sqrt(n));. Also, for n = 0, the loop doesn't execute, but num_blocks computes (0 + 0 - 1) / 0 -- another division by zero.

Drill 6: Lazy propagation push-down on last block.

cpp

void push(int b) {
    if (lazy[b] != 0) {
        for (int i = b * B; i < (b + 1) * B; i++)
            a[i] += lazy[b];
        lazy[b] = 0;
    }
}
// Problem: the last block may have fewer than B elements

What's wrong?

The loop i < (b + 1) * B goes out of bounds for the last block when n is not a multiple of B. The last block has only n - b * B elements. Fix:

cpp

for (int i = b * B; i < min((b + 1) * B, n); i++)

Without this, you write past the array end, causing undefined behavior -- which might silently corrupt other data rather than crashing.

Debug checklist

Print block sums after building -- verify they match naive sums.
Test query where $l$ and $r$ are in the same block.
Test query where $l = 0$ and $r = n - 1$ (full range).
Test with $n = 1$ (single element, single block).
After point update, verify the block sum changed correctly.

Self-Test

Without opening the implementation above, verify you can answer each item:

[ ] Write the 3-part range query (left partial + full blocks + right partial) from memory, including the same-block special case.
[ ] Given a point update a[i] = v, explain how and why you update the block sum in $O (1)$ for sum aggregates vs. $O (B)$ for min/max aggregates.
[ ] Derive the optimal block size $B$ when partial-block processing costs $O (B^{2})$ instead of $O (B)$ . (Hint: minimize $B^{2} + n / B$ .)
[ ] Name a problem category where sqrt decomposition works but segment trees struggle (non-mergeable operations, arbitrary reordering, etc.).
[ ] Explain the "same block" edge case: why does it need separate handling, and what goes wrong if you treat $l$ and $r$ as belonging to two separate partial blocks?
[ ] Describe how lazy tags work for range updates on full blocks, and why applying element-by-element to full blocks ruins the complexity.
[ ] Trace through a range query $[2, 13]$ on a 16-element array with $B = 4$ , identifying which parts are partial and which are full blocks.

Practice Problems

Rating Progression

CF Rating	What to Master
1200	Basic block decomposition: precompute block sums, answer range sum queries by combining full blocks + partial tails. Understand blocks: array of size n split into chunks of size ~√n.
1500	Lazy block updates: maintain a per-block "add" tag and apply it during queries on partial blocks. Implement range sum with point updates using sqrt blocks; handle partial blocks at boundaries.
1800	Sorted blocks for order-statistic queries; rebuilding a single block in O(√n) after an update; threshold tricks (brute-force small, DS for large). Lazy propagation within blocks, the threshold trick (values appearing > √n times), Mo's algorithm.
2100	Sqrt on queries (Mo's algorithm foundations); batched sqrt-decomposition on time for offline updates; combining sqrt blocks with other DS inside blocks. Sqrt of sqrt (block-of-blocks), sqrt decomposition on queries.

#	Problem	Source	Difficulty	Key Idea
1	Static Range Sum Queries	CSES	1100	Warmup -- prefix sums suffice, but practice block decomp as exercise
2	Dynamic Range Sum Queries	CSES	1400	Range sum + point update -- direct sqrt decomp application
3	Circular RMQ	CF 52C	1600	Range add + range min on circular array, lazy blocks
4	XOR on Segment	CF 242E	1700	Range XOR update + range sum query with sqrt or segment tree
5	Powerful Array	CF 86D	1800	Mo's algorithm -- the classic sqrt offline problem
6	Tree and Queries	CF 375D	1800	Mo's on trees with Euler tour + threshold trick
7	XOR and Favorite Number	CF 617E	1900	Mo's algorithm with XOR prefix sums
8	Holes	CF 13E	2000	Sqrt decomp with jump pointers per block
9	Serega and Fun	CF 455D	2100	Sqrt decomp with cyclic shifts inside blocks
10	Destiny	CF 840D	2100	Sqrt threshold trick -- heavy/light value decomposition

Sqrt Decomposition ​

Contents ​

Intuition ​

2.1 The Pain ​

2.2 The Key Insight ​

2.3 Visual Walkthrough ​

Step 1 -- Build block sums (O(n)) ​

Step 2 -- Range query: sum(l=2, r=13) ​

Step 3 -- Point update: set a[5]=1 ​

Step 4 -- Query after update: sum(l=4, r=7) ​

2.4 The Invariant ​

2.5 When to Reach for This ​

What the Code Won't Teach You ​

C++ STL Reference ​

Implementation (Contest-Ready) ​

Same-Block Queries Need Special Handling ​

Version 1: Range Sum Queries + Point Update -- Minimal Template ​

Version 2: Range Sum Queries + Point Update -- Explained ​

Operations Reference ​

Problem Patterns & Tricks ​

Pattern 1: Range Query + Point Update (Basic) ​

Pattern 2: Range Update + Range Query (Lazy Blocks) ​

Pattern 3: Mo's Algorithm (Offline Sqrt Decomposition on Queries) ​

Pattern 4: Threshold / Heavy-Light by Frequency ​

Pattern 5: Sorted Blocks (Insert / Delete / k-th element) ​

Pattern 6: Batch Processing (Sqrt Buffering) ​

Contest Cheat Sheet ​

Gotchas & Debugging ​

Block size choice ​

Last incomplete block ​

Same-block edge case ​

Rebuilding block aggregate ​

Integer overflow ​

Off-by-one with 1-indexed input ​

Mental Traps ​

The mistake that teaches ​

Debug Drills ​

Debug checklist ​

Self-Test ​

Practice Problems ​

Further Reading ​

Sqrt Decomposition

Contents

Intuition

2.1 The Pain

2.2 The Key Insight

2.3 Visual Walkthrough

Step 1 -- Build block sums ( $O (n)$ )

Step 2 -- Range query: sum( $l = 2$ , $r = 13$ )

Step 3 -- Point update: set $a [5] = 1$

Step 4 -- Query after update: sum( $l = 4$ , $r = 7$ )

2.4 The Invariant

2.5 When to Reach for This

What the Code Won't Teach You

C++ STL Reference

Implementation (Contest-Ready)

Same-Block Queries Need Special Handling

Version 1: Range Sum Queries + Point Update -- Minimal Template

Version 2: Range Sum Queries + Point Update -- Explained

Operations Reference

Problem Patterns & Tricks

Pattern 1: Range Query + Point Update (Basic)

Pattern 2: Range Update + Range Query (Lazy Blocks)

Pattern 3: Mo's Algorithm (Offline Sqrt Decomposition on Queries)

Pattern 4: Threshold / Heavy-Light by Frequency

Pattern 5: Sorted Blocks (Insert / Delete / k-th element)

Pattern 6: Batch Processing (Sqrt Buffering)

Contest Cheat Sheet

Gotchas & Debugging

Block size choice

Last incomplete block

Same-block edge case

Rebuilding block aggregate

Integer overflow

Off-by-one with 1-indexed input

Mental Traps

The mistake that teaches

Debug Drills

Debug checklist

Self-Test

Practice Problems

Further Reading