Skip to content

Sqrt Decomposition

Quick Revisit

  • USE WHEN: range queries + point updates and segment tree feels like overkill or is hard to apply
  • INVARIANT: array split into √n blocks; full blocks use precomputed answers, partial blocks are brute-forced
  • TIME: O(√n) per query and per update | SPACE: O(n)
  • CLASSIC BUG: last block is smaller than √n — forgetting to handle the partial tail block
  • PRACTICE: 08-ladder-mixed

AL-39 | Split an array into O(n) blocks to answer range queries and point updates in O(n) per operation -- the brute-force data structure that's often good enough.

Why this lives in the offline-techniques chapter. Sqrt decomposition itself is not an offline method -- it supports online point updates and range queries, and could equally well sit in Chapter 2. It is grouped here because (a) it is the canonical block-based fallback when segment trees feel like overkill, and (b) the block-decomposition mindset is the prerequisite for Mo's algorithm. Treat the chapter placement as a roadmap waypoint, not a claim that sqrt decomposition is fundamentally offline.

Difficulty: [Intermediate]Prerequisites: Arrays and Strings, Prefix SumsCF Rating Range: 1600 -- 2000


Contents


Intuition

2.1 The Pain

Consider an array of 16 elements. You need to support two operations: range sum queries and point updates.

  Index:   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
         +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
  a[]:   | 3 | 1 | 4 | 1 | 5 | 9 | 2 | 6 | 5 | 3 | 5 | 8 | 9 | 7 | 9 | 3 |
         +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+

The naive approach scans every element in the range: O(n) per query. With n=105 elements and q=105 queries, that is 1010 operations -- far too slow for a 2-second time limit. Prefix sums handle static queries in O(1), but a single point update forces an O(n) rebuild. We need something in between.

2.2 The Key Insight

Split the array into blocks of size Bn -- each query touches at most 2 partial blocks (O(n) each) and O(n) full blocks with precomputed answers, giving O(n) per query.

B is a performance parameter, not a sacred constant. The default B=n minimizes B+n/B, which is the right balance when partial-block scans and full-block lookups have equal per-element cost. When the costs are asymmetric -- e.g., far more queries than updates, or partial-block work is O(B2) instead of O(B) -- the optimal B shifts. It is worth tuning when you are close to the time limit; see the "block size choice" gotcha below.

Analogy. Imagine organizing books on a bookshelf. You group every n books into one shelf and label each shelf with the total number of books it holds. To count books in a range, you read the labels on all full shelves instantly, then count individual books on the two partial shelves at the ends by hand. The total work is proportional to n, not n.

  Sqrt block structure (n=16, block_size=4):

  Array:  [ 3  1  4  1 | 5  9  2  6 | 5  3  5  8 | 9  7  9  3 ]
  Blocks:  +-block 0--+  +-block 1--+  +-block 2--+  +-block 3--+
  Sums:   [    9      |     22     |     21     |     28     ]

  Query [2..13]:
  partial |##|  full block  |  full block  | partial |##|
  [ 3  1 |4  1 | 5  9  2  6 | 5  3  5  8 | 9  7| 9  3 ]
         ^                                      ^
         L=2                                    R=13

  = brute(2..3) + block_sum[1] + block_sum[2] + brute(12..13)

For our 16-element array, B=16=4, giving 4 blocks:

  Index:   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
         +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
  a[]:   | 3 | 1 | 4 | 1 | 5 | 9 | 2 | 6 | 5 | 3 | 5 | 8 | 9 | 7 | 9 | 3 |
         +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
         |   block 0     |   block 1     |   block 2     |   block 3     |
         |   sum = 9     |   sum = 22    |   sum = 21    |   sum = 28    |
         +---------------+---------------+---------------+---------------+

2.3 Visual Walkthrough

Step 1 -- Build block sums (O(n))

Scan the array once, accumulating each element into its block:

  block[0] = a[0]+a[1]+a[2]+a[3]   = 3+1+4+1   =  9
  block[1] = a[4]+a[5]+a[6]+a[7]   = 5+9+2+6   = 22
  block[2] = a[8]+a[9]+a[10]+a[11] = 5+3+5+8   = 21
  block[3] = a[12]+a[13]+a[14]+a[15] = 9+7+9+3 = 28

Step 2 -- Range query: sum(l=2, r=13)

  Index:   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
         +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
  a[]:   | 3 | 1 |*4 |*1 |*5 |*9 |*2 |*6 |*5 |*3 |*5 |*8 |*9 |*7 | 9 | 3 |
         +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
         | block 0       | block 1       | block 2       | block 3       |
                  ^                                              ^
                  l=2                                            r=13

  (a) Partial left block (block 0, indices 2..3):
      Scan a[2] + a[3] = 4 + 1 = 5              cost: 2 elements

  (b) Full middle blocks (blocks 1 and 2):
      block[1] + block[2] = 22 + 21 = 43        cost: 2 block lookups

  (c) Partial right block (block 3, indices 12..13):
      Scan a[12] + a[13] = 9 + 7 = 16           cost: 2 elements

  Total = 5 + 43 + 16 = 64
  Verify: 4+1+5+9+2+6+5+3+5+8+9+7 = 64  <-- correct

Worst-case cost: O(B)+O(n/B)+O(B). With B=n this is O(n).

  Query complexity analysis:

  +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
  |  |##|  full blocks (at most sqrt(n))  |##|  |  |
  +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
       ^                                    ^
    partial                              partial
    (<= B elts)                         (<= B elts)

  Total = O(B) + O(n/B) + O(B) = O(B + n/B)
  Minimized when B = sqrt(n):  O(sqrt(n))

Step 3 -- Point update: set a[5]=1

  Old a[5] = 9.  New a[5] = 1.  Difference = 1 - 9 = -8.
  block_id = 5 / 4 = 1.
  block[1] += -8  =>  22 - 8 = 14.
  a[5] = 1.

  After update:
         +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
  a[]:   | 3 | 1 | 4 | 1 | 5 | 1 | 2 | 6 | 5 | 3 | 5 | 8 | 9 | 7 | 9 | 3 |
         +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
         | block 0       | block 1       | block 2       | block 3       |
         | sum = 9       | sum = 14      | sum = 21      | sum = 28      |
                                ^
                              updated

  Cost: O(1)

Step 4 -- Query after update: sum(l=4, r=7)

  l and r are both in block 1.  Same-block case: just scan.

  a[4] + a[5] + a[6] + a[7] = 5 + 1 + 2 + 6 = 14
  (Matches the updated block[1] = 14.)

  Cost: O(B) = O(sqrt(n))

2.4 The Invariant

  +--------------------------------------------------------------+
  | INVARIANT: block[b] always stores the aggregate (sum / min / |
  | max) of the elements a[b*B .. min((b+1)*B-1, n-1)].         |
  |                                                              |
  | - Point update maintains this in O(1) for sum (O(B) for     |
  |   min/max, since you must rescan the block).                 |
  | - Range query decomposes [l,r] into at most O(sqrt(n))       |
  |   blocks plus two partial tails of O(sqrt(n)) elements each. |
  +--------------------------------------------------------------+

Every operation preserves the invariant. Build sets it. Point update restores it by adjusting a single block. Queries rely on it to skip full blocks.

2.5 When to Reach for This

Trigger phrases in problem statements:

  • "range queries + updates" with n,q2×105 -- sqrt decomp is the simplest solution.
  • "O(n) per query acceptable" -- you see the constraint math working out.
  • The aggregate operation is "too complex for segment tree" -- e.g., maintaining a sorted order per block, mode queries, or count-distinct queries.

Counter-examples (when sqrt decomp is NOT the right tool):

  • You need O(logn) per operation -- use a segment tree or BIT instead.
  • All queries are offline and you want O((n+q)n) with add/remove -- Mo's algorithm is the sqrt technique for that (see Mo's Algorithm).
  • n106 and tight time limit -- O(n)1000 per query may be too slow; reach for O(logn) structures.

What the Code Won't Teach You

When sqrt beats everything else. Several competition problems have no known O(nlogn) solution but have O(nn) sqrt decomposition solutions. This happens with offline queries where the operation isn't mergeable, problems where the answer can't be maintained incrementally, or cases where the segment tree constant is too large. Sqrt is often the "fallback" that works when the "clean" data structure doesn't exist.

The block decomposition mindset. Sqrt isn't just a data structure -- it's a design pattern. You can apply it to decompose algorithms (Mo's algorithm), data (heavy-light decomposition), or complexity (sqrt trick for factorials). The unifying idea is: partition into O(n) groups, precompute per-group summaries, handle cross-group queries by combining summaries. Recognizing this pattern in unfamiliar contexts is the advanced sqrt skill.

The "two levels of granularity" intuition. Every sqrt-decomposed structure maintains two views of the same data: the fine-grained element level and the coarse-grained block level. Updates touch one element and one block. Queries combine partial element-level scans with block-level summaries. Getting comfortable thinking at both levels simultaneously -- and knowing when to drop to element level vs. stay at block level -- is the core skill that the template code obscures.


C++ STL Reference

Function / ClassHeaderWhat it doesTime Complexity
vector<T><vector>Underlying array storageO(1) access
fill(first, last, val)<algorithm>Initialize block aggregatesO(n)
accumulate(first, last, init)<numeric>Sum a range (useful for building blocks)O(n)
min_element(first, last)<algorithm>Find min in a partial blockO(k)
max_element(first, last)<algorithm>Find max in a partial blockO(k)
sqrtl(n) / (int)sqrt((double)n)<cmath>Compute block sizeO(1)
__builtin_popcount(x)built-inUseful in bitwise sqrt-decomp variantsO(1)

Key detail: There is no STL "sqrt decomposition" container. You build it yourself with a vector for the array and a vector for block aggregates. That's the whole point -- it's simple enough to code from scratch in 2 minutes.


Implementation (Contest-Ready)

Same-Block Queries Need Special Handling

When query range [l, r] falls entirely within one block, the "process full blocks" loop runs zero times. You must handle this case separately:

Query [3, 6] with block_size = 4:
Block 0: [0,3], Block 1: [4,7], Block 2: [8,11]

l=3 is in block 0, r=6 is in block 1 -> different blocks -> normal path.

Query [5, 7] -- both in block 1:
  DON'T process block 1 as a full block (it's partial on both sides).
  Instead, iterate elements 5, 6, 7 directly.
cpp
int lb = l / B, rb = r / B;
if (lb == rb) {
    // Same block -- brute force elements [l, r]
    for (int i = l; i <= r; i++) ans += a[i];
} else {
    // Partial left block + full middle blocks + partial right block
    ...
}

Block rebuild on point update: After updating a[i], you must recompute block_sum[i / B] (or whatever aggregate your blocks store). Forgetting this means queries return stale block values.

  Point update with sqrt blocks:

  Update arr[6] = 10 (was 2):

  Before: [ 3  1  4  1 | 5  9  2  6 | 5  3  5  8 | 9  7  9  3 ]
  Sums:   [    9      |     22     |     21     |     28     ]
                             ^
                        block 1 contains index 6

  After:  [ 3  1  4  1 | 5  9 *10* 6 | 5  3  5  8 | 9  7  9  3 ]
  Sums:   [    9      |     30     |     21     |     28     ]
                            +8 (diff applied to block sum)

Version 1: Range Sum Queries + Point Update -- Minimal Template

cpp
#include <bits/stdc++.h>
using namespace std;

int main() {
    int n, q;
    cin >> n >> q;
    vector<long long> a(n);
    for (auto& x : a) cin >> x;

    int B = max(1, (int)sqrt(n));
    int nb = (n + B - 1) / B;
    vector<long long> bs(nb, 0);
    for (int i = 0; i < n; i++)
        bs[i / B] += a[i];

    // Point update: a[i] = val
    auto update = [&](int i, long long val) {
        bs[i / B] += val - a[i];
        a[i] = val;
    };

    // Range sum query: sum of a[l..r]
    auto query = [&](int l, int r) -> long long {
        long long res = 0;
        int bl = l / B, br = r / B;
        if (bl == br) {
            for (int i = l; i <= r; i++) res += a[i];
        } else {
            for (int i = l; i < (bl + 1) * B; i++) res += a[i];
            for (int b = bl + 1; b < br; b++) res += bs[b];
            for (int i = br * B; i <= r; i++) res += a[i];
        }
        return res;
    };

    while (q--) {
        int t; cin >> t;
        if (t == 1) {
            int i; long long v; cin >> i >> v;
            update(i, v);
        } else {
            int l, r; cin >> l >> r;
            cout << query(l, r) << "\n";
        }
    }
}

Version 2: Range Sum Queries + Point Update -- Explained

cpp
#include <bits/stdc++.h>
using namespace std;

int main() {
    int n, q;
    cin >> n >> q;
    vector<long long> a(n);
    for (auto& x : a) cin >> x;

    // Block size: sqrt(n), but at least 1 to avoid division by zero.
    int B = max(1, (int)sqrt(n));
    // Number of blocks: ceil(n / B).
    int nb = (n + B - 1) / B;

    // bs[b] = sum of elements in block b.
    vector<long long> bs(nb, 0);
    for (int i = 0; i < n; i++)
        bs[i / B] += a[i];

    // Point update: set a[i] = val.
    // Adjust the block sum by the difference (val - old value).
    // Cost: O(1).
    auto update = [&](int i, long long val) {
        bs[i / B] += val - a[i];
        a[i] = val;
    };

    // Range sum query: return a[l] + a[l+1] + ... + a[r].
    // Three cases:
    //   1) l and r are in the same block -> just scan.
    //   2) Otherwise:
    //      a) Scan partial left block: l to end of its block.
    //      b) Sum full middle blocks using bs[].
    //      c) Scan partial right block: start of its block to r.
    // Cost: O(sqrt(n)).
    auto query = [&](int l, int r) -> long long {
        long long res = 0;
        int bl = l / B, br = r / B;
        if (bl == br) {
            // Same block: just iterate.
            for (int i = l; i <= r; i++) res += a[i];
        } else {
            // Left partial block: l to end of block bl.
            for (int i = l; i < (bl + 1) * B; i++) res += a[i];
            // Full middle blocks.
            for (int b = bl + 1; b < br; b++) res += bs[b];
            // Right partial block: start of block br to r.
            for (int i = br * B; i <= r; i++) res += a[i];
        }
        return res;
    };

    while (q--) {
        int t; cin >> t;
        if (t == 1) {
            // Type 1: point update. Read index i and new value v.
            int i; long long v; cin >> i >> v;
            update(i, v);
        } else {
            // Type 2: range sum query. Read l and r (0-indexed, inclusive).
            int l, r; cin >> l >> r;
            cout << query(l, r) << "\n";
        }
    }
}

Operations Reference

OperationTimeSpace
Build (precompute block sums)O(n)O(n) for block array
Point update a[i] = valO(1)O(1)
Range sum / min / max query [l,r]O(n)O(1)
Range update (add v to [l,r]) with lazyO(n)O(n) for lazy array
Range update + range query (lazy blocks)O(n) per opO(n)
Total for q queries on array of size nO((n+q)n)O(n)

When is this fast enough? For n,q105, cost is 1053163×107. Comfortable. For n,q=5×105, cost is 3.5×108. Tight but often passes with fast I/O and simple operations.


Problem Patterns & Tricks

Pattern 1: Range Query + Point Update (Basic)

The bread-and-butter pattern. Maintain block aggregates (sum, min, max, gcd) and answer range queries in O(n). Point updates are O(1) -- just update the element and the block aggregate.

Use this when the problem has 2×105 queries and you don't want to code a segment tree.

Examples: CF 13E -- Holes, CF 797F -- Mice and Holes


Pattern 2: Range Update + Range Query (Lazy Blocks)

Add a lazy[b] value per block. A range update "add v to [l,r]" applies v to individual elements in partial blocks and adds v to lazy[b] for full blocks. Queries add lazy[b] to each block sum.

cpp
// Range add v to [l, r]:
// partial blocks: a[i] += v, rebuild block sum
// full blocks: lazy[b] += v
// Query a[l..r]:
// partial blocks: a[i] + lazy[block(i)]
// full blocks: bs[b] + lazy[b] * B

Examples: CF 52C -- Circular RMQ, CF 242E -- XOR on Segment


Pattern 3: Mo's Algorithm (Offline Sqrt Decomposition on Queries)

Sort q queries [l,r] by (l/B,r) -- block of l first, then r. Process them by adding/removing one element at a time. Total pointer movement is O((n+q)n).

This is the most important application of sqrt decomposition in competitive programming. See Mo's Algorithm for the full treatment.

Examples: CF 86D -- Powerful Array, CF 617E -- XOR and Favorite Number


Pattern 4: Threshold / Heavy-Light by Frequency

Partition values into "heavy" (frequency n) and "light" (frequency <n). There are at most n heavy values. Precompute answers for heavy values; handle light values with brute force.

This is a classic sqrt trick for count/frequency problems.

Examples: CF 375D -- Tree and Queries, CF 840D -- Destiny


Pattern 5: Sorted Blocks (Insert / Delete / k-th element)

Maintain each block as a sorted list. Insert/delete costs O(n) per operation (rebuild the block). k-th element query: binary search across block sizes. Good for dynamic order statistics without a balanced BST.

cpp
// To find k-th element across sorted blocks:
for (int b = 0; b < nb; b++) {
    if (k <= (int)block[b].size()) {
        return block[b][k - 1]; // 1-indexed k
    }
    k -= block[b].size();
}

Examples: CF 455D -- Serega and Fun


Pattern 6: Batch Processing (Sqrt Buffering)

Accumulate updates in a buffer. Every n updates, flush the buffer and rebuild the structure. Queries check both the structure and the buffer.

This amortizes expensive rebuilds. Total cost for q operations: O(qn).

Examples: CF 342E -- Xenia and Tree


Contest Cheat Sheet

+------------------------------------------------------------------+
|                  SQRT DECOMPOSITION CHEAT SHEET                  |
+------------------------------------------------------------------+
| WHEN TO USE:                                                     |
|   - Range queries + point/range updates, n <= 2e5               |
|   - Segment tree is overkill or hard to implement                |
|   - Mo's algorithm (offline queries)                             |
|   - Threshold trick: split into heavy/light by sqrt(n)           |
+------------------------------------------------------------------+
| TEMPLATE (range sum + point update):                             |
|                                                                  |
|   int B = max(1, (int)sqrt(n));                                  |
|   vector<long long> bs((n+B-1)/B, 0);                           |
|   for (int i=0;i<n;i++) bs[i/B] += a[i];  // build             |
|                                                                  |
|   // update a[i] = v:                                            |
|   bs[i/B] += v - a[i]; a[i] = v;                                |
|                                                                  |
|   // query sum [l,r]:                                            |
|   scan partial left + sum full blocks + scan partial right       |
+------------------------------------------------------------------+
| COMPLEXITY:  O(sqrt(n)) per query/update                         |
| SPACE:       O(n) array + O(sqrt(n)) block sums                 |
+------------------------------------------------------------------+
| REMEMBER:    Last block may be smaller than B                    |
|              Block of index i = i / B                            |
|              Block b covers indices [b*B, min((b+1)*B-1, n-1)]   |
+------------------------------------------------------------------+

Gotchas & Debugging

Block size choice

Always use int B = max(1, (int)sqrt(n)). The max(1, ...) prevents division by zero when n=0. Don't use ceil(sqrt(n)) -- it gives slightly worse constant in some cases, and floor is the standard choice.

For some problems, tuning B to n2/3 or a fixed constant (300--500) can help pass tight time limits. Profile if needed.

Last incomplete block

The last block may contain fewer than B elements. Your query loop must not read past index n1. The safe pattern:

cpp
for (int i = br * B; i <= r; i++) ...  // r < n, so this is safe

But if you ever iterate over "all elements in block b", bound it:

cpp
int lo = b * B;
int hi = min((b + 1) * B - 1, n - 1);
for (int i = lo; i <= hi; i++) ...

Same-block edge case

When l and r are in the same block, do not run the three-part (left partial + middle + right partial) logic. Just iterate from l to r. Forgetting this causes double-counting.

Rebuilding block aggregate

After a point update, you must update the block aggregate. For sum, the O(1) trick is bs[i/B] += new_val - old_val. For min/max, you may need to rescan the entire block in O(B). If the problem requires min/max, account for this cost.

Integer overflow

Block sums can be large. If each element is up to 109 and block size is n450, a block sum can reach 4.5×1011. Use long long.

Off-by-one with 1-indexed input

Many CF problems use 1-indexed arrays. Convert to 0-indexed immediately, or adjust block formulas: block = (i - 1) / B for 1-indexed.

Mental Traps

Trap: "Segment tree is always better than sqrt decomposition."

Segment trees have O(logn) per operation vs. O(n), but sqrt decomposition has a much smaller constant factor. For n=105 with tight TL, a well-implemented sqrt solution can be faster in practice. More importantly, sqrt decomposition handles operations that segment trees cannot -- e.g., arbitrary range reordering, operations without a good "merge" function.

Trap: "The block size is always n."

The optimal B minimizes B+n/B, which gives n only when the cost-per-element equals the cost-per-block. If partial-block processing costs O(B2) (e.g., re-sorting a block), the balance shifts: minimize B2+n/B -> B=n1/3. When in doubt, try a few values empirically.

Trap: "Sqrt decomposition is trivial to implement."

The idea is trivial. The boundary cases (same block, left partial, right partial), lazy updates for range modifications, and block rebuilds on updates are all fertile ground for off-by-one errors. Most WAs in sqrt decomposition come from boundary handling, not algorithmic misunderstanding.

Trap: "I should always use sqrt decomposition for range queries."

For static arrays, prefix sums give O(1) queries. For n>106 with tight time limits, O(n)1000 per query may be too slow -- use O(logn) structures instead. Sqrt decomposition lives in the sweet spot: dynamic arrays with n2×105 and operations that don't fit cleanly into a segment tree.

The mistake that teaches

RMQ with same-block queries. I implemented sqrt decomposition with B=317. It worked on small tests, then WA on test 7. When l and r are in the same block, my code processed them as "partial left block + full middle blocks + partial right block" -- but there were no full middle blocks, and the two partial pieces overlapped, double-counting elements. Fix: if l/B == r/B, brute-force from l to r directly. The "left tail + full blocks + right tail" decomposition assumes at least one full block in between.

CF 13E Holes (hidden chain inside rebuild). For each cell, precompute where you exit the current block and how many jumps it takes. My rebuild had a nested loop -- for each cell, I followed the jump chain to the block boundary, which could be O(B) per cell, making rebuild O(B2)=O(n). Total: O(nQ). TLE. Fix: process cells within the block in reverse order. Cell i's exit info depends on cell i+jump[i]. If that's still in the same block, copy its precomputed answer; otherwise the answer is the direct jump. This makes rebuild truly O(B). Sqrt decomposition is O(n) only if your block operations are genuinely O(1) or O(B).

Debug Drills

Drill 1: Off-by-one in block boundary computation.

cpp
int B = (int)sqrt(n);
vector<long long> block_sum(n / B + 1, 0);
for (int i = 0; i < n; i++)
    block_sum[i / B] += a[i];

long long query(int l, int r) {
    long long res = 0;
    int bl = l / B, br = r / B;
    if (bl == br) {
        for (int i = l; i <= r; i++) res += a[i];
        return res;
    }
    for (int i = l; i < (bl + 1) * B; i++) res += a[i]; // left tail
    for (int b = bl + 1; b < br; b++) res += block_sum[b]; // full blocks
    for (int i = br * B; i <= r; i++) res += a[i]; // right tail
    return res;
}
What's wrong?

When n is not divisible by B, the expression n / B + 1 may allocate one too few blocks. For example, if n = 10 and B = 3, the blocks are {0,1,2}, {3,4,5}, {6,7,8}, {9} -- that's 4 blocks, but 10/3 + 1 = 4 happens to work. However, if n = 9 and B = 3, you get 9/3 + 1 = 4 blocks but only need 3. The real danger: if B = (int)sqrt(n) rounds down to 0 when n = 0, you get division by zero. Fix: use B = max(1, (int)sqrt(n)) and allocate (n + B - 1) / B blocks.

Drill 2: Forgetting to apply lazy tag before reading individual elements.

cpp
void range_add(int l, int r, int val) {
    int bl = l / B, br = r / B;
    if (bl == br) {
        for (int i = l; i <= r; i++) a[i] += val;
        block_sum[bl] += (long long)val * (r - l + 1);
        return;
    }
    for (int i = l; i < (bl + 1) * B; i++) a[i] += val;
    block_sum[bl] += (long long)val * ((bl + 1) * B - l);
    for (int b = bl + 1; b < br; b++) lazy[b] += val;
    for (int i = br * B; i <= r; i++) a[i] += val;
    block_sum[br] += (long long)val * (r - br * B + 1);
}

int point_query(int idx) {
    return a[idx]; // BUG
}
What's wrong?

point_query returns a[idx] without adding lazy[idx / B]. The lazy tag stores a pending addition that hasn't been applied to individual elements. Fix: return a[idx] + lazy[idx / B];. Alternatively, push the lazy tag to all elements in the block before accessing them.

Drill 3: Rebuilding sorted block incorrectly after update.

cpp
// Each block maintains a sorted copy for order-statistic queries
void update(int pos, int new_val) {
    int b = pos / B;
    a[pos] = new_val;
    // Rebuild sorted block
    sorted_block[b].clear();
    for (int i = b * B; i < (b + 1) * B; i++)
        sorted_block[b].push_back(a[i]);
    sort(sorted_block[b].begin(), sorted_block[b].end());
}
What's wrong?

The rebuild loop iterates from b*B to (b+1)*B, but for the last block, (b+1)*B may exceed n. This reads uninitialized or out-of-bounds memory. Fix: use min((b+1)*B, n) as the upper bound: for (int i = b*B; i < min((b+1)*B, n); i++).

Drill 4: Inclusive vs exclusive right boundary in partial block.

cpp
int B = sqrt(n);
for (int i = 0; i < n; i++)
    block_id[i] = i / B;

// Query [l, r]:
if (block_id[l] == block_id[r]) {
    // brute force
} else {
    // partial left block: l to end of block
    for (int i = l; i < (block_id[l] + 1) * B; i++) // ...
    // full blocks
    for (int b = block_id[l] + 1; b < block_id[r]; b++) // ...
    // partial right block: start of block to r
    for (int i = block_id[r] * B; i < r; i++) // ...  BUG HERE
}
What's wrong?

The right partial block loop uses i < r but should use i <= r. The range is [l, r] inclusive, so position r itself must be included. This off-by-one means the rightmost element of every query is silently excluded -- producing answers that are slightly wrong.

Drill 5: Block size when n is small.

cpp
int B = sqrt(n);
int num_blocks = (n + B - 1) / B;
vector<long long> block_sum(num_blocks, 0);
for (int i = 0; i < n; i++)
    block_sum[i / B] += a[i];
What's wrong?

When n = 0, B = sqrt(0) = 0, and i / B causes division by zero. Even for small n like 1-3, sqrt(n) might return 1, which works but leads to degenerate behavior. Always guard: int B = max(1, (int)sqrt(n));. Also, for n = 0, the loop doesn't execute, but num_blocks computes (0 + 0 - 1) / 0 -- another division by zero.

Drill 6: Lazy propagation push-down on last block.

cpp
void push(int b) {
    if (lazy[b] != 0) {
        for (int i = b * B; i < (b + 1) * B; i++)
            a[i] += lazy[b];
        lazy[b] = 0;
    }
}
// Problem: the last block may have fewer than B elements
What's wrong?

The loop i < (b + 1) * B goes out of bounds for the last block when n is not a multiple of B. The last block has only n - b * B elements. Fix:

cpp
for (int i = b * B; i < min((b + 1) * B, n); i++)

Without this, you write past the array end, causing undefined behavior -- which might silently corrupt other data rather than crashing.

Debug checklist

  1. Print block sums after building -- verify they match naive sums.
  2. Test query where l and r are in the same block.
  3. Test query where l=0 and r=n1 (full range).
  4. Test with n=1 (single element, single block).
  5. After point update, verify the block sum changed correctly.

Self-Test

Without opening the implementation above, verify you can answer each item:

  • [ ] Write the 3-part range query (left partial + full blocks + right partial) from memory, including the same-block special case.
  • [ ] Given a point update a[i] = v, explain how and why you update the block sum in O(1) for sum aggregates vs. O(B) for min/max aggregates.
  • [ ] Derive the optimal block size B when partial-block processing costs O(B2) instead of O(B). (Hint: minimize B2+n/B.)
  • [ ] Name a problem category where sqrt decomposition works but segment trees struggle (non-mergeable operations, arbitrary reordering, etc.).
  • [ ] Explain the "same block" edge case: why does it need separate handling, and what goes wrong if you treat l and r as belonging to two separate partial blocks?
  • [ ] Describe how lazy tags work for range updates on full blocks, and why applying element-by-element to full blocks ruins the complexity.
  • [ ] Trace through a range query [2,13] on a 16-element array with B=4, identifying which parts are partial and which are full blocks.

Practice Problems

Rating Progression

CF RatingWhat to Master
1200Basic block decomposition: precompute block sums, answer range sum queries by combining full blocks + partial tails. Understand blocks: array of size n split into chunks of size ~√n.
1500Lazy block updates: maintain a per-block "add" tag and apply it during queries on partial blocks. Implement range sum with point updates using sqrt blocks; handle partial blocks at boundaries.
1800Sorted blocks for order-statistic queries; rebuilding a single block in O(√n) after an update; threshold tricks (brute-force small, DS for large). Lazy propagation within blocks, the threshold trick (values appearing > √n times), Mo's algorithm.
2100Sqrt on queries (Mo's algorithm foundations); batched sqrt-decomposition on time for offline updates; combining sqrt blocks with other DS inside blocks. Sqrt of sqrt (block-of-blocks), sqrt decomposition on queries.
#ProblemSourceDifficultyKey Idea
1Static Range Sum QueriesCSES1100Warmup -- prefix sums suffice, but practice block decomp as exercise
2Dynamic Range Sum QueriesCSES1400Range sum + point update -- direct sqrt decomp application
3Circular RMQCF 52C1600Range add + range min on circular array, lazy blocks
4XOR on SegmentCF 242E1700Range XOR update + range sum query with sqrt or segment tree
5Powerful ArrayCF 86D1800Mo's algorithm -- the classic sqrt offline problem
6Tree and QueriesCF 375D1800Mo's on trees with Euler tour + threshold trick
7XOR and Favorite NumberCF 617E1900Mo's algorithm with XOR prefix sums
8HolesCF 13E2000Sqrt decomp with jump pointers per block
9Serega and FunCF 455D2100Sqrt decomp with cyclic shifts inside blocks
10DestinyCF 840D2100Sqrt threshold trick -- heavy/light value decomposition

Further Reading

Built for the climb from Codeforces 1100 to 2100.