Appearance
FFT & NTT (Polynomial Multiplication)
Quick Revisit
- USE WHEN: polynomial multiplication and sum convolution (counting pairs with sum = k). For XOR convolution use FWHT (Walsh-Hadamard transform), not FFT.
- INVARIANT: evaluate at n points, multiply pointwise, interpolate back -- convolution theorem
- TIME: O(n log n) | SPACE: O(n)
- CLASSIC BUG: array size not padded to next power of 2 -- causes out-of-bounds or wrong results
- PRACTICE: ../12-practice-ladders/08-ladder-mixed.md
Difficulty: [Advanced] CF 1900-2200+ Prerequisites: Math Fundamentals, Number Theory, Divide and ConquerMotivation: When you need to multiply polynomials or compute convolutions in
Contest Frequency: [*] Specialized -- appears in polynomial/convolution problems
Contents
- Intuition
- How It Works (Detailed Mechanics)
- Implementation
- Patterns & Applications
- Complexity Analysis
- Edge Cases & Gotchas
- Debugging Checklist
- Worked Examples
- Self-Test
- Practice Problems
- Summary
FFT vs NTT — pick before you implement
The two transforms share the same convolution-theorem skeleton but differ in arithmetic:
| FFT | NTT | |
|---|---|---|
| Field | Complex numbers (double / long double) | |
| "Roots of unity" | A primitive | |
| Exact? | No — accumulates floating-point rounding error | Yes — every operation is an integer mod |
| Risk | Loss of precision when coefficients × n is large | Modulus must support |
| Use when | Real-valued / complex coefficients, or output fits comfortably in double precision | Answer is required modulo a specific prime (typically |
Decision rule: if the problem asks for the answer modulo a fixed NTT-friendly prime (especially
Intuition
The Pain
Multiply
The naive approach: every coefficient of
1 2 3 (coefficients of A)
x 4 5 (coefficients of B)
-----------
5 10 15 (multiply by 5, shift by 1)
+ 4 8 12 (multiply by 4, shift by 0)
-----------
4 13 22 15 => C(x) = 4 + 13x + 22x^2 + 15x^3This is exactly like long multiplication. For degree-
The wall: Any approach that touches every pair of coefficients is
The Key Insight
"Evaluate both polynomials at
Why does this work? A polynomial of degree
Coefficient repr. Point-value repr. Coefficient repr.
A = [a0..an-1] ----> A(w0), A(w1), ..., A(wn-1)
B = [b0..bn-1] ----> B(w0), B(w1), ..., B(wn-1)
|
multiply pointwise O(n)
|
C(wi) = A(wi) * B(wi)
|
C = [c0..c2n-2] <---- interpolate back (inverse FFT)The "special points" are the
Visual Walkthrough
Step 1: Evaluate (FFT forward)
Split polynomial into even and odd indexed coefficients:
For
Then
Since
Butterfly diagram (n=8, showing one level):
Input: a0 a1 a2 a3 a4 a5 a6 a7
| | | |
+--+-+--+ split even/odd
| | | |
Even: a0 a2 a4 a6 Odd: a1 a3 a5 a7
\ / \ /
FFT(n/2) FFT(n/2)
| |
E0 E1 E2 E3 O0 O1 O2 O3
\ | | / \ | | /
combine with twiddle factors w^k:
y[k] = E[k] + w^k * O[k] (k = 0..n/2-1)
y[k+n/2] = E[k] - w^k * O[k] (k = 0..n/2-1)Full butterfly diagram (n=8) -- all levels of the iterative FFT:
FFT BUTTERFLY NETWORK (n=8, iterative, bottom-up)
Inputs in bit-reversed order on the left; output = DFT values on the right.
Level 0 (input) Level 1 (2-pt DFTs) Level 2 (4-pt DFTs) Level 3 (8-pt DFT)
bit-reversed idx combine pairs combine quads combine all 8
================== ==================== ==================== ==================
a0 (000) ----*------+----- X0' ---*------+------ X0'' --*------+------ Y0
| w^0 | | w^0 | | w^0 |
a4 (100) ----*------+----- X1' ---+------+------ X1'' ---+------+------ Y1
| w^1 | | w^1 |
a2 (010) ----*------+----- X2' ---*------+------ X2'' ---+------+------ Y2
| w^0 | | w^2 |
a6 (110) ----*------+----- X3' ---+------+------ X3'' ---+------+------ Y3
| | | w^3 |
a1 (001) ----*------+----- X4' ---*------+------ X4'' --*------+------ Y4
| w^0 | | w^0 | | |
a5 (101) ----*------+----- X5' ---+------+------ X5'' ---+------+------ Y5
| w^1 | | |
a3 (011) ----*------+----- X6' ---*------+------ X6'' ---+------+------ Y6
| w^0 | | |
a7 (111) ----*------+----- X7' ---+------+------ X7'' ---+------+------ Y7
Each ---*---+--- is a butterfly: top = u + w^k * v
bot = u - w^k * v
Level 1: stride=1, twiddle w^0 (all trivial +/- pairs)
Level 2: stride=2, twiddles w^0, w^1 (w = w_4 = e^{2*pi*i/4} = i)
Level 3: stride=4, twiddles w^0, w^1, w^2, w^3 (w = w_8 = e^{2*pi*i/8})8th roots of unity on the complex plane:
Im
^
w^2 | w^1
(-0.7+0.7i)| (0.7+0.7i)
* | *
\ | /
w^3 \ | / w^0
(-1+0i) *-----\-+-/-------* (1+0i) --> Re
/ | \
/ | \
* | *
(-0.7-0.7i)| (0.7-0.7i)
w^4 | w^7
(= -1) |
* | *
w^5 | w^6
(-0.7-0.7i)| (0.7-0.7i)
w_8 = e^{2*pi*i/8}, w^k = (cos(2*pi*k/8), sin(2*pi*k/8))
Key identity: (w^k)^2 = w^{2k} maps 8th roots to 4th roots
w^0=1 w^1=(1+i)/sqrt2 w^2=i w^3=(-1+i)/sqrt2
w^4=-1 w^5=(-1-i)/sqrt2 w^6=-i w^7=(1-i)/sqrt2
The HALVING PROPERTY: squaring w_8^k gives w_4^k
w_8^0 -> w_4^0, w_8^1 -> w_4^1, w_8^2 -> w_4^2, w_8^3 -> w_4^3
w_8^4 -> w_4^0, w_8^5 -> w_4^1, w_8^6 -> w_4^2, w_8^7 -> w_4^3
This is WHY the even/odd split works: both halves evaluate at SAME points!The Convolution Theorem -- why FFT makes polynomial multiplication fast:
TIME DOMAIN FREQUENCY DOMAIN
(coefficient representation) (point-value representation)
+---------------------------+ +---------------------------+
| A = [a0, a1, ..., an-1] | --FFT--> | A^ = [A(w^0), ..., A(w^{n-1})] |
| B = [b0, b1, ..., bn-1] | --FFT--> | B^ = [B(w^0), ..., B(w^{n-1})] |
+---------------------------+ +---------------------------+
| |
| Naive multiply: | Pointwise multiply:
| O(n^2) | O(n)
| |
v v
+---------------------------+ +---------------------------+
| C = A * B | <--IFFT-- | C^ = A^ . B^ |
| (convolution, O(n^2)) | | C^[k] = A^[k] * B^[k] |
+---------------------------+ +---------------------------+
Total cost: FFT(A) + FFT(B) + pointwise + IFFT(C)
= O(n log n) + O(n log n) + O(n) + O(n log n)
= O(n log n)
THE KEY: convolution in time domain = pointwise product in freq domain.
FFT converts between the two in O(n log n).Step 2: Pointwise multiply -- trivially
Step 3: Interpolate (inverse FFT) -- same algorithm, but use
The Invariant
"At each recursion level, the polynomial is split into even and odd coefficients, evaluated at roots of unity of half the order."
More precisely: at recursion depth
Why
When to Reach for This
Trigger phrases in problem statements:
| Signal | Translation |
|---|---|
| "polynomial multiplication" | Direct FFT/NTT |
| "convolution" / "convolve" | FFT/NTT |
| "count ways with sum constraints" | Generating function + FFT |
| " | Probably needs FFT |
| "number of pairs with | Frequency array convolution |
| "string matching with wildcards" | Convolution trick |
| "XOR/OR/AND convolution" | Bitwise convolution variant |
| "answer modulo | NTT-friendly prime, strong hint |
Quick decision:
- Floating point OK, moderate
? Use FFT (doubles). - Need exact answers mod a prime? Use NTT.
has primitive root and supports NTT up to .
Insight Gaps
Recognizing when a problem needs convolution. The hardest part of FFT problems is not implementing FFT -- it's seeing that the problem is a convolution in disguise. Triggers:
- "Count pairs
with for each " --> self-convolution of frequency array - "Multiply two polynomials" --> direct application
- "Count ways to roll
dice summing to " --> multiply generating functions - "String matching with wildcards" --> cross-correlation via FFT
- "XOR convolution" --> Walsh-Hadamard transform (different algorithm, same structure)
WHT for XOR/OR/AND convolutions. When the "index combination" operation is XOR/OR/AND instead of addition, standard FFT doesn't apply. The Walsh-Hadamard Transform (WHT) handles XOR convolution; sum-over-subsets (SOS/zeta transform) handles OR/AND. These are different transforms from FFT but follow the same "forward transform, pointwise operation, inverse transform" structure.
CONVOLUTION TYPE --> ALGORITHM DECISION:
C[k] = sum_{i + j = k} A[i] * B[j] --> FFT / NTT
C[k] = sum_{i XOR j = k} A[i] * B[j] --> WHT (Walsh-Hadamard)
C[k] = sum_{i OR j = k} A[i] * B[j] --> Zeta (sum-over-subsets)
C[k] = sum_{i AND j = k} A[i] * B[j] --> Zeta (superset-sum)
All follow the same recipe:
forward_transform(A), forward_transform(B)
pointwise multiply
inverse_transform(result)
Only the transform function changes.The bit-reversal permutation. The iterative FFT reorders input by bit-reversal of indices before the butterfly passes. Understanding why (the recursive splitting by even/odd indices produces a specific permutation, and bit-reversal is its inverse) is the difference between blindly using the implementation and being able to debug it.
What the Code Won't Teach You
The FFT code converts between coefficient representation and point-value representation. What it hides is why this is useful: pointwise multiplication is
THE FFT RECIPE -- what the code actually does:
Coefficient form Point-value form
┌─────────────┐ FFT ┌─────────────┐
│ A = [a₀..aₙ]│──────────->│ A(ω⁰)..A(ωⁿ)│
└─────────────┘ └──────┬──────┘
│ pointwise
┌─────────────┐ FFT ┌──────┴──────┐ multiply
│ B = [b₀..bₙ]│──────────->│ B(ω⁰)..B(ωⁿ)│──-> C(ωᵏ)=A(ωᵏ)*B(ωᵏ)
└─────────────┘ └─────────────┘
│
┌─────────────┐ inv FFT ┌─────────────┐ │
│ C = [c₀..cₙ]│<-──────────│ C(ω⁰)..C(ωⁿ)│<-──────┘
└─────────────┘ └─────────────┘
Cost: O(n log n) + O(n) + O(n log n) = O(n log n)
vs. naive coefficient multiply: O(n²)The code also won't teach you the evaluation-interpolation duality. FFT evaluates a polynomial at
BIT-REVERSAL PERMUTATION for n=8:
Index (decimal): 0 1 2 3 4 5 6 7
Index (binary): 000 001 010 011 100 101 110 111
Reversed: 000 100 010 110 001 101 011 111
New position: 0 4 2 6 1 5 3 7
Input array: [a₀ a₁ a₂ a₃ a₄ a₅ a₆ a₇]
After perm: [a₀ a₄ a₂ a₆ a₁ a₅ a₃ a₇]
This reordering makes the butterfly stages work in-place:
Stage 1: pairs (0,1)(2,3)(4,5)(6,7) -- stride 1
Stage 2: pairs (0,2)(1,3)(4,6)(5,7) -- stride 2
Stage 3: pairs (0,4)(1,5)(2,6)(3,7) -- stride 4Finally, when precision fails (coefficients exceed ~llround. The NTT avoids this entirely by working in exact modular arithmetic, but only for NTT-friendly primes.
The code also won't teach you the deep connection between FFT and NTT: both are instances of the Discrete Fourier Transform over different algebraic structures. FFT works over
WHY 998244353 IS THE CANONICAL NTT PRIME:
998244353 = 119 x 2²³ + 1
Key properties:
┌─────────────────────────────────────────────────┐
│ 1. It's prime -> modular inverse │
│ 2. p-1 = 119 x 2²³ -> 2²³ divides p-1 │
│ 3. Max NTT size = 2²³ -> ~8 million elements │
│ 4. Primitive root g = 3 -> verified, standard │
└─────────────────────────────────────────────────┘
Compare with 10⁹+7 = 1000000007:
┌─────────────────────────────────────────────────┐
│ 10⁹+7 - 1 = 2 x 500000003 │
│ Largest power of 2 dividing (p-1) = 2¹ │
│ Max NTT size = 2 <- USELESS for NTT! │
└─────────────────────────────────────────────────┘
When you need mod 10⁹+7: use split-coefficient FFT
(3-4 real FFTs on 15-bit halves) or multi-prime NTT
with CRT reconstruction.BUTTERFLY DIAGRAM for n=8 FFT:
a₀ ──┬──────────┬──────────┬──── X₀
a₄ ──┤ ω⁰ ┤ ω⁰ ┤
├──────────┤ │
a₂ ──┤ ├──────────┤──── X₁
a₆ ──┤ ω⁰ │ ω¹ │
├──────────┤ │
a₁ ──┤ ├──────────┤──── X₂
a₅ ──┤ ω⁰ │ ω² │
├──────────┤ │
a₃ ──┤ ├──────────┤──── X₃
a₇ ──┘ ω⁰ ┘ ω³ ┘
Each stage: combine pairs with twiddle factor ωᵏ
3 stages for n=8 (log₂8 = 3)The connection between FFT and NTT: both are DFT, just over different rings. FFT computes the Discrete Fourier Transform over complex<double> with long long and replacing
The evaluation-interpolation duality is the soul of FFT. The FFT code performs two operations: evaluate a polynomial at
Why
FFT vs NTT -- SAME ALGORITHM, DIFFERENT RINGS:
┌──────────────────────┬───────────────────────────────┐
│ FFT │ NTT │
├──────────────────────┼───────────────────────────────┤
│ Ring: ℂ (complex) │ Ring: ℤ/pℤ (mod prime p) │
│ Root: ω = e^(2πi/n) │ Root: ω = g^((p-1)/n) mod p │
│ Inverse: conjugate │ Inverse: modular inverse │
│ Scaling: ÷ n │ Scaling: x n⁻¹ mod p │
│ Precision: <= 10¹⁵ │ Precision: exact (mod p) │
├──────────────────────┴───────────────────────────────┤
│ Butterfly structure, bit-reversal, O(n log n) │
│ -- ALL identical between FFT and NTT │
└──────────────────────────────────────────────────────┘
WHY 998244353?
998244353 = 119 x 2²³ + 1
^^^
max NTT size = 2²³ = 8,388,608
primitive root g = 3
Compare:
10⁹ + 7 = 500000003 x 2¹ + 1 -> max NTT size = 2 (useless!)
985661441 = 235 x 2²² + 1 -> max NTT size = 2²² = 4,194,304
469762049 = 7 x 2²⁶ + 1 -> max NTT size = 2²⁶ (even bigger!)How It Works (Detailed Mechanics)
Roots of Unity Primer
The
Key properties:
- Periodicity:
- Halving:
(the crucial one for FFT) - Cancellation:
, so - Summation:
for
Property 3 gives us the "butterfly": we compute
Cooley-Tukey (Recursive FFT)
FFT(a[0..n-1], w):
if n == 1: return a
even = FFT(a[0], a[2], ..., a[n-2], w^2)
odd = FFT(a[1], a[3], ..., a[n-1], w^2)
t = 1
for k = 0 to n/2 - 1:
a[k] = even[k] + t * odd[k]
a[k + n/2] = even[k] - t * odd[k]
t = t * w
return aIterative FFT (Bit-Reversal)
The recursive version has overhead from memory allocation. The iterative version works in-place by first permuting the array according to bit-reversal of indices, then building up the butterfly from small to large.
Bit-reversal example (
Index: 0 1 2 3 4 5 6 7
Binary: 000 001 010 011 100 101 110 111
Reverse: 000 100 010 110 001 101 011 111
Decimal: 0 4 2 6 1 5 3 7After bit-reversal, the iterative butterflies naturally combine adjacent pairs, then groups of 4, then 8, etc.
FFT vs NTT -- SAME BUTTERFLY, DIFFERENT ARITHMETIC:
FFT butterfly (complex): NTT butterfly (modular):
u = a[j] u = a[j]
v = a[j + half] * w v = a[j + half] * w % MOD
a[j] = u + v a[j] = (u + v) % MOD
a[j + half] = u - v a[j + half] = (u - v + MOD) % MOD
w *= wn w = w * wn % MOD
┌─────────────────┬──────────────────────────┐
│ FFT │ NTT │
├─────────────────┼──────────────────────────┤
│ w = e^(2πi/n) │ w = g^((p-1)/n) mod p │
│ complex mul │ modular mul │
│ ÷ n at end │ x n⁻¹ mod p at end │
│ precision ~10¹⁵ │ exact (mod p) │
│ uses sin/cos │ uses modpow │
└─────────────────┴──────────────────────────┘NTT: Modular Arithmetic Version
Replace complex roots of unity with primitive roots modulo a prime.
For prime
Common NTT-friendly primes:
Prime p-1 factorization Max NTT size Prim. root
998244353 119 * 2^23 2^23 = 8388608 3
985661441 235 * 2^22 2^22 = 4194304 3
469762049 7 * 2^26 2^26 = 67108864 3
167772161 5 * 2^25 2^25 = 33554432 3Inverse NTT: Same as NTT but with
Implementation
Recursive FFT (Educational)
cpp
#include <bits/stdc++.h>
using namespace std;
using cd = complex<double>;
const double PI = acos(-1.0);
void fft_recursive(vector<cd>& a, bool invert) {
int n = a.size();
if (n == 1) return;
vector<cd> even(n / 2), odd(n / 2);
for (int i = 0; i < n / 2; i++) {
even[i] = a[2 * i];
odd[i] = a[2 * i + 1];
}
fft_recursive(even, invert);
fft_recursive(odd, invert);
double ang = 2 * PI / n * (invert ? -1 : 1);
cd w(1), wn(cos(ang), sin(ang));
for (int i = 0; i < n / 2; i++) {
a[i] = even[i] + w * odd[i];
a[i + n / 2] = even[i] - w * odd[i];
w *= wn;
}
}Iterative FFT (Contest-Ready, Doubles)
cpp
#include <bits/stdc++.h>
using namespace std;
using cd = complex<double>;
const double PI = acos(-1.0);
void fft(vector<cd>& a, bool invert) {
int n = a.size();
// Bit-reversal permutation
for (int i = 1, j = 0; i < n; i++) {
int bit = n >> 1;
for (; j & bit; bit >>= 1)
j ^= bit;
j ^= bit;
if (i < j) swap(a[i], a[j]);
}
// Butterfly computation
for (int len = 2; len <= n; len <<= 1) {
double ang = 2 * PI / len * (invert ? -1 : 1);
cd wlen(cos(ang), sin(ang));
for (int i = 0; i < n; i += len) {
cd w(1);
for (int j = 0; j < len / 2; j++) {
cd u = a[i + j];
cd v = a[i + j + len / 2] * w;
a[i + j] = u + v;
a[i + j + len / 2] = u - v;
w *= wlen;
}
}
}
if (invert) {
for (auto& x : a) x /= n;
}
}
// Multiply two polynomials given as coefficient vectors.
// Returns coefficient vector of the product.
vector<long long> multiply(vector<int>& a, vector<int>& b) {
vector<cd> fa(a.begin(), a.end()), fb(b.begin(), b.end());
int n = 1;
while (n < (int)(a.size() + b.size() - 1)) n <<= 1;
fa.resize(n);
fb.resize(n);
fft(fa, false);
fft(fb, false);
for (int i = 0; i < n; i++) fa[i] *= fb[i];
fft(fa, true);
vector<long long> res(n);
for (int i = 0; i < n; i++) res[i] = llround(fa[i].real());
return res;
}Precision note: With double, reliable for coefficients up to about
NTT (Contest-Ready, mod 998244353)
cpp
#include <bits/stdc++.h>
using namespace std;
const int MOD = 998244353;
const int PRIM_ROOT = 3;
long long power(long long a, long long b, long long mod) {
long long res = 1;
a %= mod;
while (b > 0) {
if (b & 1) res = res * a % mod;
a = a * a % mod;
b >>= 1;
}
return res;
}
long long inv(long long a, long long mod) {
return power(a, mod - 2, mod);
}
void ntt(vector<long long>& a, bool invert) {
int n = a.size();
// Bit-reversal permutation
for (int i = 1, j = 0; i < n; i++) {
int bit = n >> 1;
for (; j & bit; bit >>= 1)
j ^= bit;
j ^= bit;
if (i < j) swap(a[i], a[j]);
}
// Butterfly computation
for (int len = 2; len <= n; len <<= 1) {
long long w = invert ? inv(PRIM_ROOT, MOD) : PRIM_ROOT;
w = power(w, (MOD - 1) / len, MOD);
for (int i = 0; i < n; i += len) {
long long wn = 1;
for (int j = 0; j < len / 2; j++) {
long long u = a[i + j];
long long v = a[i + j + len / 2] * wn % MOD;
a[i + j] = (u + v) % MOD;
a[i + j + len / 2] = (u - v + MOD) % MOD;
wn = wn * w % MOD;
}
}
}
if (invert) {
long long n_inv = inv(n, MOD);
for (auto& x : a) x = x * n_inv % MOD;
}
}
// Multiply two polynomials modulo MOD.
vector<long long> multiply_ntt(vector<long long>& a, vector<long long>& b) {
vector<long long> fa(a), fb(b);
int n = 1;
while (n < (int)(a.size() + b.size() - 1)) n <<= 1;
fa.resize(n);
fb.resize(n);
ntt(fa, false);
ntt(fb, false);
for (int i = 0; i < n; i++) fa[i] = fa[i] * fb[i] % MOD;
ntt(fa, true);
fa.resize(a.size() + b.size() - 1);
return fa;
}Convolution Wrapper (Frequency Array)
A very common pattern: given arrays of counts, compute the "sum convolution."
cpp
// Given freq[v] = number of elements with value v,
// compute conv[s] = number of pairs (i, j) with a[i] + a[j] = s.
vector<long long> sum_convolution(vector<int>& freq) {
// Self-convolution via FFT
vector<cd> fa(freq.begin(), freq.end());
int n = 1;
while (n < (int)(2 * freq.size() - 1)) n <<= 1;
fa.resize(n);
fft(fa, false);
for (int i = 0; i < n; i++) fa[i] *= fa[i];
fft(fa, true);
vector<long long> res(n);
for (int i = 0; i < n; i++) res[i] = llround(fa[i].real());
return res;
}Arbitrary Modulus Convolution (Split Trick)
When the modulus is not NTT-friendly, split coefficients into high and low 15-bit halves and use 3 FFTs:
cpp
// Multiply polynomials mod any prime using FFT with doubles.
// Split each coefficient a[i] = a_hi[i] * BASE + a_lo[i].
// Then a*b = (a_hi*b_hi)*BASE^2 + (a_hi*b_lo + a_lo*b_hi)*BASE + a_lo*b_lo
// Each sub-product is computed via FFT on doubles (fits in precision).
vector<long long> multiply_mod(vector<int>& a, vector<int>& b, long long mod) {
const int BASE = 1 << 15; // 32768
int need = a.size() + b.size() - 1;
int n = 1;
while (n < need) n <<= 1;
vector<cd> a_lo(n), a_hi(n), b_lo(n), b_hi(n);
for (int i = 0; i < (int)a.size(); i++) {
a_lo[i] = a[i] % BASE;
a_hi[i] = a[i] / BASE;
}
for (int i = 0; i < (int)b.size(); i++) {
b_lo[i] = b[i] % BASE;
b_hi[i] = b[i] / BASE;
}
fft(a_lo, false); fft(a_hi, false);
fft(b_lo, false); fft(b_hi, false);
// Combine: 4 pointwise products
vector<cd> hh(n), hl(n), lh(n), ll(n);
for (int i = 0; i < n; i++) {
hh[i] = a_hi[i] * b_hi[i];
hl[i] = a_hi[i] * b_lo[i];
lh[i] = a_lo[i] * b_hi[i];
ll[i] = a_lo[i] * b_lo[i];
}
fft(hh, true); fft(hl, true); fft(lh, true); fft(ll, true);
vector<long long> res(need);
for (int i = 0; i < need; i++) {
long long v_hh = llround(hh[i].real()) % mod;
long long v_hl = llround(hl[i].real()) % mod;
long long v_lh = llround(lh[i].real()) % mod;
long long v_ll = llround(ll[i].real()) % mod;
res[i] = ((v_hh % mod * BASE % mod * BASE % mod)
+ (v_hl + v_lh) % mod * BASE % mod
+ v_ll) % mod;
res[i] = (res[i] % mod + mod) % mod;
}
return res;
}Patterns & Applications
Pattern 1: Direct Polynomial Multiplication
When: The problem literally asks you to multiply two polynomials or compute a convolution of two sequences.
Given two polynomials A and B, compute C = A * B.
Solution: Direct FFT/NTT multiply.
Time: O(n log n)Pattern 2: Counting with Generating Functions
When: "Count the number of ways to pick items whose values sum to
Idea: Create a generating function where the coefficient of
Example: You have
cpp
// Each die contributes: x^1 + x^2 + x^3 + x^4 + x^5 + x^6
// For n dice, multiply this polynomial with itself n times.
// Use repeated squaring: O(n * S * log n) via FFT, where S = 6n.
vector<long long> dice_ways(int n) {
// Single die polynomial: coefficients for x^1..x^6
vector<long long> single = {0, 1, 1, 1, 1, 1, 1};
vector<long long> result = {1}; // identity: x^0
// Binary exponentiation on polynomials
auto poly = single;
int exp = n;
while (exp > 0) {
if (exp & 1) result = multiply_ntt(result, poly);
poly = multiply_ntt(poly, poly);
exp >>= 1;
}
return result; // result[s] = number of ways to get sum s
}CONVOLUTION TYPE -> ALGORITHM DECISION TREE:
What operation combines indices?
│
├─ i + j = k (additive)
│ ├─ Need exact mod p?
│ │ ├─ p is NTT-friendly (p = c*2ᵏ+1, k large) -> NTT mod p
│ │ ├─ p = 10⁹+7 (not NTT-friendly) -> split-coefficient FFT (7 FFTs)
│ │ └─ arbitrary p -> CRT with 2-3 NTT-friendly primes
│ └─ No modulus / small values -> FFT on doubles + llround
│
├─ i OR j = k -> SOS (zeta/Möbius) transform
├─ i AND j = k -> superset-sum transform
├─ i XOR j = k -> Walsh-Hadamard Transform (WHT)
│
└─ i * j = k (multiplicative) -> Dirichlet convolution
(no fast transform in general; use factorization tricks)CHOOSING YOUR FFT/NTT VARIANT -- QUICK REFERENCE:
┌─────────────────────────────────────────────────────────────┐
│ START HERE │
│ │ │
│ Need modular arithmetic? │
│ ╱ ╲ │
│ YES NO │
│ │ │ │
│ What modulus? Values small │
│ ╱ │ ╲ enough for │
│ NTT- 10⁹+7 Other doubles? │
│ friendly mod ╱ ╲ │
│ │ │ │ YES NO │
│ ↓ ↓ ↓ ↓ ↓ │
│ NTT Split CRT FFT Multi- │
│ mod p coeff with with prime │
│ FFT 2-3 llround NTT │
│ (7 NTT + CRT │
│ FFTs) primes │
└─────────────────────────────────────────────────────────────┘
Decision factors:
┌──────────────────┬────────────┬──────────────────────┐
│ Variant │ Speed │ When to use │
├──────────────────┼────────────┼──────────────────────┤
│ NTT mod p │ Fastest │ p is NTT-friendly │
│ FFT doubles │ Fast │ n*v² < 10¹⁵ │
│ Split-coeff FFT │ ~7x FFT │ mod 10⁹+7 │
│ Multi-prime NTT │ ~3x NTT │ arbitrary modulus │
└──────────────────┴────────────┴──────────────────────┘Pattern 3: Frequency Convolution (Pair Counting)
When: "How many pairs
Build a frequency array freq[v] and self-convolve. The result at index
cpp
// freq[v] = count of elements equal to v
// After convolution: res[k] = sum over all v of freq[v] * freq[k-v]
// = number of ordered pairs summing to kVariations:
- Difference instead of sum: reverse one array, then convolve.
- Product: take logarithms to convert to sums (with care for precision).
FREQUENCY SELF-CONVOLUTION -- counting pair sums:
Array: [1, 3, 2, 3, 5]
Frequency: freq[1]=1, freq[2]=1, freq[3]=2, freq[5]=1
Convolution C = freq ⊛ freq:
C[k] = Σ freq[i] * freq[k-i]
C[2] = freq[1]*freq[1] = 1 (pair: 1+1)
C[3] = freq[1]*freq[2] + freq[2]*freq[1] = 2 (pairs: 1+2, 2+1)
C[4] = freq[1]*freq[3] + freq[2]*freq[2] + freq[3]*freq[1] = 5
C[6] = freq[1]*freq[5] + freq[3]*freq[3] + freq[5]*freq[1] = 6
Value: 0 1 2 3 4 5 6 7 8 ...
freq: 0 1 1 2 0 1 0 0 0
conv: 0 0 1 2 5 4 6 4 1 ...
^ ^
| |
1+1=2 (1 way) 1+3=4, 2+2=4, 3+1=4 (5 ordered ways)Pattern 4: String Matching via Convolution
When: "Find all positions where pattern
Encode characters as numbers. A match at position
Expand the squared term and compute each part via convolution (reversing
cpp
// Simplified: wildcard = 0 in both T and P
// Match score at position s:
// sum_j T[s+j]^2 * P[j] - 2 * T[s+j] * P[j]^2 + P[j]^3 ... (expanded)
// Each sum is a convolution after reversing P.Pattern 5: Bitwise Convolution (OR/AND/XOR)
Standard convolution computes
OR convolution:
XOR convolution:
cpp
// Walsh-Hadamard Transform (XOR convolution)
void wht(vector<long long>& a, bool invert) {
int n = a.size();
for (int len = 1; len < n; len <<= 1) {
for (int i = 0; i < n; i += len << 1) {
for (int j = 0; j < len; j++) {
long long u = a[i + j];
long long v = a[i + j + len];
a[i + j] = (u + v) % MOD;
a[i + j + len] = (u - v + MOD) % MOD;
}
}
}
if (invert) {
long long inv_n = inv(n, MOD);
for (auto& x : a) x = x * inv_n % MOD;
}
}
// OR convolution (subset-sum / zeta transform)
void or_transform(vector<long long>& a, bool invert) {
int n = a.size();
for (int len = 1; len < n; len <<= 1) {
for (int i = 0; i < n; i += len << 1) {
for (int j = 0; j < len; j++) {
if (!invert)
a[i + j + len] = (a[i + j + len] + a[i + j]) % MOD;
else
a[i + j + len] = (a[i + j + len] - a[i + j] + MOD) % MOD;
}
}
}
}
// AND convolution (superset-sum / zeta transform)
void and_transform(vector<long long>& a, bool invert) {
int n = a.size();
for (int len = 1; len < n; len <<= 1) {
for (int i = 0; i < n; i += len << 1) {
for (int j = 0; j < len; j++) {
if (!invert)
a[i + j] = (a[i + j] + a[i + j + len]) % MOD;
else
a[i + j] = (a[i + j] - a[i + j + len] + MOD) % MOD;
}
}
}
}
// Usage: XOR convolution of two arrays
vector<long long> xor_convolve(vector<long long> a, vector<long long> b) {
int n = 1;
while (n < max(a.size(), b.size())) n <<= 1;
a.resize(n); b.resize(n);
wht(a, false); wht(b, false);
for (int i = 0; i < n; i++) a[i] = a[i] * b[i] % MOD;
wht(a, true);
return a;
}Pattern 6: Power of a Polynomial / Repeated Squaring
When: You need
Combine polynomial multiplication with binary exponentiation. Cost:
cpp
vector<long long> poly_power(vector<long long> base, int k, int max_deg) {
vector<long long> result = {1};
while (k > 0) {
if (k & 1) {
result = multiply_ntt(result, base);
if ((int)result.size() > max_deg + 1)
result.resize(max_deg + 1);
}
base = multiply_ntt(base, base);
if ((int)base.size() > max_deg + 1)
base.resize(max_deg + 1);
k >>= 1;
}
return result;
}Complexity Analysis
| Operation | Time | Space |
|---|---|---|
| FFT (forward or inverse) | ||
| NTT (forward or inverse) | ||
| Polynomial multiply | ||
| WHT / OR / AND transform | ||
| Arbitrary-mod multiply (split) | ||
| Poly power via binary exp |
Constant factors:
- FFT (doubles): ~4-5x slower than NTT due to floating point and
sin/cos. - NTT (mod 998244353): Fastest for modular problems.
- The split trick for arbitrary modulus uses ~7 FFTs total.
When
POLYNOMIAL SIZE -> FFT SIZE -> RESULT SIZE:
Input: A has |A| coefficients, B has |B| coefficients
Result: C = A*B has |A|+|B|-1 coefficients
FFT needs: n_fft = next power of 2 >= |A|+|B|-1
Example: |A|=3, |B|=4
A: [a₀ a₁ a₂] 3 coefficients
B: [b₀ b₁ b₂ b₃] 4 coefficients
C: [c₀ c₁ c₂ c₃ c₄ c₅] = 3+4-1 = 6 coefficients
FFT size = next_pow2(6) = 8
Padded A: [a₀ a₁ a₂ 0 0 0 0 0 ] (8 entries)
Padded B: [b₀ b₁ b₂ b₃ 0 0 0 0 ] (8 entries)
FFT out: [c₀ c₁ c₂ c₃ c₄ c₅ 0 0 ] (8 entries)
Result: [c₀ c₁ c₂ c₃ c₄ c₅] (take first 6)
Work done: 3 FFTs of size 8 = 3 x 8 x log₂(8) = 72 operations
If we padded to 16 instead: 3 x 16 x 4 = 192 operations (2.67x slower!)ROOTS OF UNITY ON THE UNIT CIRCLE (n=8):
ω¹ = e^(2πi/8)
*
ω² ╱ ╲ ω⁰ = 1
* *
╱ ╲
ω³ *───────────────* ω⁷ <- the 8 points where
╲ ╱ we evaluate the polynomial
* *
ω⁴ ╲ ╱ ω⁶
*
ω⁵
Key properties:
• ω⁸ = 1 (they're 8th roots of unity)
• ωᵏ⁺⁴ = -ωᵏ (half-circle symmetry -> butterfly trick)
• Sum of all roots = 0 (cancellation lemma)
• In NTT: replace unit circle with cyclic group mod p
ω = g^((p-1)/n) mod p, where g is a primitive rootRUNTIME SCALING -- WHY CONSTANTS MATTER IN O(n log n):
FFT size (n) log₂n n*log₂n Relative to 2¹⁷
──────────────────────────────────────────────────────
2¹⁵ = 32K 15 491K 0.14x
2¹⁶ = 64K 16 1049K 0.30x
2¹⁷ = 128K 17 2228K 1.00x <- baseline
2¹⁸ = 256K 18 4718K 2.12x
2¹⁹ = 512K 19 9961K 4.47x
2²⁰ = 1M 20 20972K 9.41x
2²¹ = 2M 21 44040K 19.77x
2²² = 4M 22 92274K 41.42x
2²³ = 8M 23 192937K 86.60x <- 998244353 max
Key takeaway: going from 2¹⁷ -> 2²⁰ is ~9x slower,
not just "a little more padding." Every extra bit in
the exponent roughly doubles the work.
Each FFT multiply does 3 transforms (FFT, FFT, IFFT),
so total work = 3 x n x log₂n butterfly operations.MEMORY LAYOUT -- CACHE EFFECTS IN FFT:
Small n (fits in L1/L2 cache):
┌──────────────────────────────────────┐
│ All butterfly operations hit cache │
│ -> near-peak throughput │
└──────────────────────────────────────┘
Large n (exceeds cache):
┌──────────────────────────────────────┐
│ Later stages stride > cache line │
│ -> cache misses on every access │
│ -> 3-5x slower per operation │
└──────────────────────────────────────┘
Practical consequence:
n = 2¹⁸ (256K complex doubles = 4 MB) ~= L3 boundary
n = 2²⁰ (1M complex doubles = 16 MB) -> cache-hostile
-> NTT (mod arithmetic, no doubles) uses half the memoryEdge Cases & Gotchas
Forgetting to pad to power of 2: The FFT/NTT size must be a power of 2 and at least
. Precision with doubles:
llroundinstead of(long long)to avoid off-by-one rounding errors. Coefficients abovemay lose precision. Modular inverse existence: For NTT,
(the FFT size) must be invertible mod . Since is prime and is a power of 2 that divides , this is always satisfied for NTT-friendly primes. Off-by-one in result size: Product of polynomials of sizes
and has size , not . NTT overflow: Intermediate products
can overflow long longif both are close to. Use __int128or ensure values are reduced modbefore multiplication (they should be if done correctly). Negative coefficients in NTT: Add
before taking mod to avoid negative values: (u - v + MOD) % MOD.Self-convolution double counting: When convolving a frequency array with itself to count pairs,
counts ordered pairs. For unordered pairs, handle and separately. Reusing arrays between test cases: Clear or resize properly. A common bug is leftover data from a previous FFT call.
Mental Traps
"FFT is just an efficient algorithm. I don't need to understand the math." You can use it as a black box -- until you need to modify it. Handling non-power-of-2 sizes, choosing between FFT and NTT, understanding why precision fails at large coefficient values, recognizing that the XOR convolution needs WHT not FFT -- all of these require knowing what the algorithm actually does. The "just copy-paste the template" approach breaks at the first non-standard application.
"If I'm working modulo some prime, use NTT." Only if the prime is NTT-friendly (
"Convolution and polynomial multiplication are different things." They're not. The convolution of arrays
"I need exact integer arithmetic, so doubles won't work." With llround, FFT on doubles gives exact integer results as long as the intermediate values don't exceed
"I'll just make the FFT size much bigger to be safe." Doubling
"The result polynomial has the same size as my input arrays." The product of polynomials of sizes
POLYNOMIAL SIZE ARITHMETIC:
A = [a₀, a₁, a₂] |A| = 3 (degree 2)
B = [b₀, b₁, b₂, b₃] |B| = 4 (degree 3)
Product C = A x B:
degree(C) = 2 + 3 = 5
|C| = 3 + 4 - 1 = 6 <- THIS is how many coefficients
FFT size needed:
next_power_of_2(6) = 8
Common mistakes:
X Reading |C| = max(3,4) = 4 -> misses c₄, c₅
X Reading |C| = 3 + 4 = 7 -> c₆ is garbage (always 0, but wasteful)
X FFT size = max(3,4) = 4 -> aliasing! Results wrap around
OK Reading |C| = 3 + 4 - 1 = 6 -> correct
OK FFT size = 8 -> correct (next power of 2 >= 6)"I'll just increase the FFT size to be safe." Doubling the FFT size doubles the work (
"The result array has the same size as the input." The product of a degree-
SIZE RELATIONSHIPS IN POLYNOMIAL MULTIPLICATION:
A has |A| coefficients (degree |A|-1)
B has |B| coefficients (degree |B|-1)
C = A*B has |A|+|B|-1 coefficients
Example: A = [1, 2, 3] (|A|=3)
B = [4, 5, 6, 7] (|B|=4)
Product size = 3 + 4 - 1 = 6 coefficients
FFT size = next power of 2 >= 6 = 8
┌───┬───┬───┬───┬───┬───┬───┬───┐
│ c₀│ c₁│ c₂│ c₃│ c₄│ c₅│ 0 │ 0 │ <- FFT output (size 8)
└───┴───┴───┴───┴───┴───┴───┴───┘
├─── result: take first 6 ───┤ ├junk┤
Common bugs:
X result.resize(max(|A|,|B|)) = 4 -> loses c₄, c₅
X result.resize(n_fft) = 8 -> includes garbage zeros
OK result.resize(|A|+|B|-1) = 6 -> correctTrap: "Self-convolution counts each pair once."
Self-convolution of a frequency array counts ordered pairs. If freq = [0, 2, 1] (two 1s, one 2), then conv[2] = freq[1]*freq[1] = 4, which counts pairs (1,1), (1,1), (1,1), (1,1) -- all four ordered selections of two elements from {1, 1}. For unordered pairs, you need to subtract the
SELF-CONVOLUTION PAIR COUNTING:
Array: [1, 1, 2] freq[1]=2, freq[2]=1
Raw convolution result for sum k:
conv[2] = freq[0]*freq[2] + freq[1]*freq[1] + freq[2]*freq[0]
= 0*1 + 2*2 + 1*0 = 4 <- ORDERED pairs summing to 2
Ordered pairs summing to 2: (1,1) x 4 ways
(elem 0 + elem 1), (elem 1 + elem 0),
(elem 0 + elem 0), (elem 1 + elem 1) <- includes i=j!
Adjustments for UNORDERED DISTINCT pairs:
1. Subtract i=j: conv[2k] -= freq[k] (for each k)
2. Divide by 2: conv[s] /= 2
┌──────────────────────────────────────────────┐
│ Raw conv = ordered pairs (including i=j) │
│ Subtract diagonal -> ordered pairs (i!=j) │
│ Divide by 2 -> unordered pairs (i!=j) │
└──────────────────────────────────────────────┘Trap: "I can do NTT with any prime, as long as I have a primitive root."
You need a primitive root of order exactly
NTT PRIME COMPATIBILITY CHECK:
For NTT of size n (must be power of 2):
REQUIREMENT: n divides (p - 1)
p = 998244353: p-1 = 998244352 = 119 x 2²³
Max n = 2²³ = 8,388,608 OK (plenty for contests)
p = 10⁹ + 7: p-1 = 10⁹ + 6 = 2 x 500000003
Max n = 2¹ = 2 X (useless!)
p = 469762049: p-1 = 469762048 = 7 x 2²⁶
Max n = 2²⁶ = 67,108,864 OK (even bigger!)
┌───────────────────────────────────────────────┐
│ Before using NTT, ALWAYS verify: │
│ 1. p is prime │
│ 2. Find largest power of 2 dividing p-1 │
│ 3. That power >= your required FFT size │
│ 4. Know the primitive root (usually g=3) │
└───────────────────────────────────────────────┘Debugging Checklist
When your FFT/NTT solution gives wrong answers:
[ ] Is n (FFT size) a power of 2 and >= sum of input sizes?
[ ] Did you resize BOTH input arrays to n before FFT?
[ ] Is the result array read up to the correct index (a.size() + b.size() - 2)?
[ ] For FFT: are you using llround(), not (long long) cast?
[ ] For NTT: is the modulus NTT-friendly? Does 2^k divide (p-1)?
[ ] For NTT: is the primitive root correct for your modulus?
[ ] Is the inverse FFT dividing by n (or multiplying by n^{-1} mod p)?
[ ] For convolution: did you reverse the correct array?
[ ] Are intermediate multiplications overflowing?
[ ] For multiple test cases: are you clearing/resizing properly?Quick sanity test: Multiply
cpp
// Debug helper: test basic multiplication
void test_fft() {
vector<int> a = {1, 1}, b = {1, 1};
auto c = multiply(a, b);
// c should be [1, 2, 1]
assert(c[0] == 1 && c[1] == 2 && c[2] == 1);
}Worked Examples
Example 0: Step-by-Step FFT Multiply --
Goal: Compute
STEP 0: PAD TO n=4
A = [1, 2, 0, 0] B = [3, 1, 0, 0]
STEP 1: EVALUATE (forward FFT at 4th roots of unity)
The 4th roots of unity: w_4^0=1, w_4^1=i, w_4^2=-1, w_4^3=-i
A(w^0) = 1 + 2(1) + 0 + 0 = 3
A(w^1) = 1 + 2(i) + 0 + 0 = 1 + 2i
A(w^2) = 1 + 2(-1) + 0 + 0 = -1
A(w^3) = 1 + 2(-i) + 0 + 0 = 1 - 2i
B(w^0) = 3 + 1(1) + 0 + 0 = 4
B(w^1) = 3 + 1(i) + 0 + 0 = 3 + i
B(w^2) = 3 + 1(-1) + 0 + 0 = 2
B(w^3) = 3 + 1(-i) + 0 + 0 = 3 - i
A^ = [3, 1+2i, -1, 1-2i]
B^ = [4, 3+i, 2, 3-i ]
STEP 2: POINTWISE MULTIPLY
C^[0] = 3 * 4 = 12
C^[1] = (1+2i)(3+i) = 3+i+6i+2i^2 = 3+7i-2 = 1+7i
C^[2] = (-1) * 2 = -2
C^[3] = (1-2i)(3-i) = 3-i-6i+2i^2 = 3-7i-2 = 1-7i
C^ = [12, 1+7i, -2, 1-7i]
STEP 3: INTERPOLATE (inverse FFT = evaluate at w^{-1}, divide by n)
Use IFFT formula: c[k] = (1/n) * sum_{j=0}^{n-1} C^[j] * w^{-jk}
c[0] = (12 + (1+7i) + (-2) + (1-7i)) / 4 = 12/4 = 3
c[1] = (12 + (1+7i)(w^{-1}) + (-2)(w^{-2}) + (1-7i)(w^{-3})) / 4
= (12 + (1+7i)(-i) + (-2)(-1) + (1-7i)(i)) / 4
= (12 + (-i+7) + 2 + (i+7)) / 4 = 28/4 = 7
c[2] = (12 + (1+7i)(-1) + (-2)(1) + (1-7i)(-1)) / 4
= (12 - 1 - 7i - 2 - 1 + 7i) / 4 = 8/4 = 2
c[3] = 0 (as expected -- no x^3 term)
RESULT: C = [3, 7, 2] --> 3 + 7x + 2x^2 OKExample 1: Big Number Multiplication
Problem: Multiply two large numbers represented as digit strings.
Approach: Treat each number as a polynomial where coefficient
cpp
string big_multiply(string& s1, string& s2) {
int n1 = s1.size(), n2 = s2.size();
vector<int> a(n1), b(n2);
// Reverse so index 0 = least significant digit
for (int i = 0; i < n1; i++) a[i] = s1[n1 - 1 - i] - '0';
for (int i = 0; i < n2; i++) b[i] = s2[n2 - 1 - i] - '0';
auto c = multiply(a, b);
// Propagate carries
string res;
long long carry = 0;
for (int i = 0; i < (int)c.size() || carry; i++) {
long long cur = carry + (i < (int)c.size() ? c[i] : 0);
res += (char)('0' + cur % 10);
carry = cur / 10;
}
// Remove leading zeros
while (res.size() > 1 && res.back() == '0') res.pop_back();
reverse(res.begin(), res.end());
return res;
}Example 2: Counting Subset Sums
Problem: Given
Approach: The generating function for item
cpp
// D&C polynomial multiplication of (1 + x^{v[0]}) * ... * (1 + x^{v[n-1]})
vector<long long> dc_multiply(vector<vector<long long>>& polys, int l, int r) {
if (l == r) return polys[l];
int mid = (l + r) / 2;
auto left = dc_multiply(polys, l, mid);
auto right = dc_multiply(polys, mid + 1, r);
return multiply_ntt(left, right);
}
vector<long long> count_subset_sums(vector<int>& values) {
vector<vector<long long>> polys;
for (int v : values) {
vector<long long> p(v + 1, 0);
p[0] = 1;
p[v] = 1;
polys.push_back(p);
}
return dc_multiply(polys, 0, (int)polys.size() - 1);
}Example 3: Codeforces-Style Problem
Problem (CF 954I style): Given a binary text ?), find all positions in
Approach: Define
Reverse
cpp
vector<int> wildcard_match(string& text, string& pat) {
int n = text.size(), m = pat.size();
// Encode: '1' -> 1, '0' -> -1 (or keep as 0/1), '?' -> 0
// For simplicity, use the direct squared-difference approach
// with 3 convolutions.
// t[i], p[j]: character value (0 for wildcard)
// Match at s: sum_j p[j]^3 * t[s+j] - 2*p[j]^2*t[s+j]^2 + p[j]*t[s+j]^3
// = sum_j t[s+j]*p[j]*(t[s+j]-p[j])^2
// Reverse p for convolution
// ... (3 convolutions, check if result[s] == 0)
// Placeholder for full implementation:
vector<int> match_positions;
// (see pattern 4 above for the full technique)
return match_positions;
}Self-Test
Without looking at the reference material, you should be able to:
Write the iterative FFT from memory -- including bit-reversal permutation, butterfly loop, and the
invertflag for inverse FFT.Write the
multiplywrapper -- computing the required FFT size, padding both arrays, forward FFT, pointwise multiply, inverse FFT, and rounding results.State when to use FFT vs. NTT -- and which moduli are NTT-friendly. Why is
special? Why is problematic? Recognize a frequency-array convolution problem -- "count pairs summing to
" -> build freq array -> self-convolve. State the WHT for XOR convolution -- at least at the level of "it's a different transform, same structure."
Explain the bit-reversal permutation -- why does the iterative FFT reorder inputs this way, and what happens if you skip it?
Trace through a small FFT by hand -- e.g., multiply
through the full evaluate -> pointwise multiply -> interpolate pipeline and verify the result is .
- [ ] Given arrays of sizes 5 and 7, compute the exact FFT size needed and the exact result array size. (Answer: result size = 11, FFT size = 16.)
- [ ] Compute
for , to find the primitive 8th root of unity used in an NTT of size 8. - [ ] Manually verify that
llroundgives correct results for multiplyingvia FFT on doubles. What is the largest coefficient, and is it below ? - [ ] Write a modular-arithmetic
multiplywrapper for NTT that handles: (a) computing FFT size, (b) padding both arrays, (c) forward NTT, (d) pointwise multiply mod, (e) inverse NTT with scaling, (f) trimming the result to . - [ ] Given a problem "count the number of ways to pick two elements from an array with sum
," write the full solution using frequency-array self-convolution, correctly handling the and ordered-vs-unordered pair adjustments. - [ ] Given
and , compute the required FFT size ( ) and verify it fits within 998244353's max NTT size ( ). - [ ] Compute the primitive
-th root of unity modulo : . Verify it has order exactly 8. - [ ] For arrays with values up to
and length , compute the max coefficient ( ) and decide: FFT-doubles or NTT? (Answer: NTT -- exceeds precision limit.) - [ ] Write an NTT-based
multiplyfunction that works mod 998244353, test it on-> . - [ ] Given a problem "count unordered pairs
with ", write the frequency-array convolution and handle the double-counting correction. - [ ] Multiply
by hand through the full FFT pipeline: pad to size 4, compute 4th roots of unity, evaluate both polynomials, pointwise multiply, inverse FFT, and verify . - [ ] Given a frequency array freq = [0, 3, 0, 2, 0, 1] (three 1s, two 3s, one 5), compute the self-convolution entry for sum
and explain the ordered-vs-unordered pair adjustment. - [ ] Write the bit-reversal permutation for
-- list all 16 indices and their bit-reversed counterparts in a table. - [ ] Explain why padding to
when suffices roughly doubles the FFT runtime -- relate this to the formula and compute the actual operation counts. - [ ] For an NTT mod 998244353, compute the primitive 4th root of unity:
. Verify it has order exactly 4 by checking that its square equals .
Practice Problems
| # | Problem | Key Technique | Difficulty |
|---|---|---|---|
| 1 | SPOJ POLYMUL | Direct polynomial multiplication | CF ~1600 |
| 2 | CF 632E - Thief in a Shop | Generating functions + poly power | CF 2000 |
| 3 | CF 993E - Nikita and Order Statistics | Counting via convolution | CF 1900 |
| 4 | CF 954I - Yet Another String Matching | String matching via FFT | CF 2200 |
| 5 | CF 958F3 - Lightsabers (hard) | NTT + generating functions | CF 2100 |
| 6 | CF 623E - Transforming Sequence | NTT + binary lifting on polys | CF 2600 |
| 7 | Atcoder ATC001 C - Fast Fourier Transform | Basic FFT practice | CF ~1700 |
| 8 | CF 528D - Fuzzy Search | FFT for approximate matching | CF 2000 |
| 9 | CF 1096G - Lucky Tickets | NTT on frequency arrays | CF 2000 |
| 10 | CF 300D - Painting Squares | FFT + combinatorics | CF 2100 |
| 11 | Apples and Bananas | Convolution of two frequency arrays | CSES 2000 |
| 12 | One Bit Positions | Bitstring autocorrelation via FFT | CSES 2000 |
| 13 | Signal Processing | Cross-correlation of two signals | CSES 2000 |
Suggested progression:
- Start with SPOJ POLYMUL and Atcoder ATC001 C to verify your template.
- Move to CF 993E and 1096G for standard NTT applications.
- Try CF 632E for polynomial exponentiation.
- Tackle CF 528D and 954I for convolution-based string matching.
- Attempt CF 623E for a challenging NTT + D&C combination.
Summary
FFT / NTT Cheat Sheet
=====================
WHEN: Polynomial multiplication, convolution, "count sums", n >= 10^5
TIME: O(n log n) per multiply
SPACE: O(n)
FFT (doubles):
+ Works with any modulus (or no modulus)
- Precision issues for large coefficients (> ~10^14)
- Slower constant factor
NTT (mod p):
+ Exact, fast, no precision issues
- Only works modulo NTT-friendly primes
- 998244353 is the standard contest prime
RECIPE:
1. Pad both arrays to next power of 2 >= |a| + |b| - 1
2. Forward FFT/NTT on both
3. Pointwise multiply
4. Inverse FFT/NTT
5. (FFT) Round to nearest integer / (NTT) done
BITWISE VARIANTS:
XOR convolution -> Walsh-Hadamard Transform
OR convolution -> Sum-over-subsets (zeta) transform
AND convolution -> Superset-sum transform
All O(n log n), same butterfly structure.
ARBITRARY MODULUS:
Split coefficients into 15-bit halves, use 3-4 FFTs on doubles.The Aha Moment
Multiplying two polynomials of degree
NTT is the same idea, but over
Pattern Fingerprint
| Constraint / Signal | Technique |
|---|---|
| "Multiply two polynomials" or "convolve two sequences" | FFT / NTT |
| FFT / NTT gives | |
| "Count pairs | Convolution of frequency arrays |
| "Multiply large numbers" ( | Treat digits as polynomial coefficients, FFT-multiply |
| Answer must be modulo | NTT -- this prime has |
| Answer modulo arbitrary prime (e.g., | Split coefficients + multiple NTTs, or CRT with NTT-friendly primes |
| "Partition function" / "subset sum count" / generating functions | Polynomial operations via FFT |
| Divide-and-conquer optimization with convolution | Online FFT / divide-and-conquer FFT |
Rating Progression
| CF Rating | What You Should Be Able To Do |
|---|---|
| 1200 | Know that polynomial multiplication can be done faster than |
| 1500 | Recognize problems reducible to convolution (frequency array multiplication); use library FFT |
| 1800 | Implement FFT/NTT from scratch; handle size padding to powers of 2; know when to use NTT vs FFT |
| 2100 | Polynomial division, modular composition, online convolution; combine FFT with generating functions; handle arbitrary modulus via CRT or splitting |
Before You Code Checklist
- [ ] Pad both polynomials to the same size = next power of 2
-- the result polynomial has degree , so you need that many points - [ ] Choose FFT (doubles) or NTT (modular) -- FFT is simpler but has precision issues for
with large coefficients; NTT is exact but requires a suitable prime - [ ] For NTT: verify the prime supports the needed transform size --
supports up to ; for larger, use a different prime or CRT - [ ] Bit-reverse the array before the butterfly -- or use the iterative FFT with bit-reversal permutation
- [ ] After inverse FFT: divide by
-- the inverse transform includes a factor; forgetting it gives answers scaled by
What Would You Do If...?
Scenario 1: "The problem needs convolution modulo
Scenario 2: "You need to multiply polynomials of degree double precision (double FFT will give wrong answers.
Scenario 3: "The problem reduces to multiplying many polynomials
The Mistake That Teaches
Story: You implement NTT for the first time. You use mod
You double-check your NTT implementation: butterfly structure looks right, forward and inverse transforms are consistent. You test: multiply
Then you notice: your padded size inv_n = power(n, MOD-2). You print inv_n... it looks right. But wait -- you're computing power(n, MOD-2, MOD) where n is an int, and int. That's fine. Except... your power function takes long long arguments but inside you wrote base *= base without % MOD. For the forward NTT, the root of unity w = power(g, (MOD-1)/n) -- you check: (MOD-1)/n for n = 1048576 gives (998244352) / 1048576 = 952. So w = power(3, 952, MOD). You print w... it's correct.
After two hours, you find it: in the butterfly loop, you wrote t = a[j + half] * w % MOD -- but a[j + half] * w overflows long long because both are up to long long (
Finally, you spot it: you forgot to take % MOD after a[j] + t. The sum of two values each up to long long. But it DOESN'T stay in
The fix: Always % MOD after BOTH addition AND subtraction in the butterfly:
cpp
long long t = a[j + half] % MOD * w % MOD;
a[j + half] = (a[j] - t + MOD) % MOD;
a[j] = (a[j] + t) % MOD;Lesson: In modular NTT, take % MOD after every single arithmetic operation. Values that "fit in long long" today will overflow when multiplied tomorrow.
Debug This
Bug 1:
Find the bug in this FFT implementation
cpp
void fft(vector<complex<double>>& a, bool invert) {
int n = a.size();
// bit-reversal permutation
for (int i = 1, j = 0; i < n; i++) {
int bit = n >> 1;
for (; j & bit; bit >>= 1) j ^= bit;
j ^= bit;
if (i < j) swap(a[i], a[j]);
}
// butterfly
for (int len = 2; len <= n; len <<= 1) {
double ang = 2 * M_PI / len * (invert ? -1 : 1);
complex<double> wlen(cos(ang), sin(ang));
for (int i = 0; i < n; i += len) {
complex<double> w(1);
for (int j = 0; j < len; j++) { // BUG HERE
complex<double> u = a[i+j], v = a[i+j+len/2] * w;
a[i+j] = u + v;
a[i+j+len/2] = u - v;
w *= wlen;
}
}
}
if (invert) for (auto& x : a) x /= n;
}Bug: The inner loop runs j from 0 to len - 1, but it should only go to len/2 - 1. When j >= len/2, the index i + j + len/2 goes out of the current block (accesses i + len and beyond). This reads/writes wrong elements.
Fix: Change j < len to j < len / 2.
Bug 2:
Find the bug in this NTT-based polynomial multiplication
cpp
vector<long long> multiply(vector<long long>& a, vector<long long>& b) {
int result_size = a.size() + b.size() - 1;
int n = 1;
while (n < result_size) n <<= 1;
a.resize(n); b.resize(n);
ntt(a, false); // forward NTT
ntt(b, false);
for (int i = 0; i < n; i++)
a[i] = a[i] * b[i] % MOD;
ntt(a, true); // inverse NTT
a.resize(result_size);
return a;
}Bug: The function modifies a and b in place (resizing them with zero-padding). If the caller reuses a or b after this call, they'll find their original data destroyed and padded with zeros. Also, a[i] * b[i] may overflow if both are close to MOD (
Fix: Work on copies:
cpp
vector<long long> fa(a.begin(), a.end()), fb(b.begin(), b.end());
fa.resize(n); fb.resize(n);And for the multiplication: fa[i] = fa[i] % MOD * fb[i] % MOD;
Bug 3:
Find the bug in this convolution for counting pair sums
cpp
// Count number of pairs (i,j) with a[i] + a[j] = k, for each k
// a[i] in [0, 100000]
vector<long long> freq(100001, 0);
for (int x : a) freq[x]++;
// Convolve freq with itself
auto result = multiply(freq, freq);
// result[k] = number of ordered pairs with a[i] + a[j] = k
// But we want unordered pairs, and exclude i == j
for (int k = 0; k < result.size(); k++) {
result[k] -= freq[k/2]; // subtract pairs where i == j ???
result[k] /= 2; // unordered pairs
}Bug: The "subtract diagonal" logic is wrong. A pair result[2 * a[i]], not result[a[i]]. To remove self-pairs, subtract freq[x] from result[2*x] for each freq[k/2] from result[k] for ALL
Fix:
cpp
for (int x = 0; x <= 100000; x++)
result[2 * x] -= freq[x]; // remove i == j pairs
for (int k = 0; k < result.size(); k++)
result[k] /= 2; // unorderedCommon Mistakes
- Not padding to next power of 2. FFT requires array sizes to be powers of 2. The result array size must be at least deg(A) + deg(B) + 1, rounded up.
- Precision loss with FFT over doubles. For large coefficients or many terms, use NTT (mod prime) or split into high/low halves.
- Wrong NTT prime or generator. The prime must be of the form p = c * 2^k + 1 where 2^k >= n. Standard: 998244353 = 119 * 2^23 + 1, generator 3.
- Forgetting to divide by n in inverse FFT. The IFFT step requires dividing each coefficient by n (or by n mod p for NTT).
- Self-convolution diagonal error. When counting unordered pairs (i,j) with a[i]+a[j]=k, subtract self-pairs from result[2*x] for each x, then divide by 2.
Historical Origin
The Fast Fourier Transform was published by James Cooley and John Tukey in 1965, revolutionizing digital signal processing. However, historians later discovered that Carl Friedrich Gauss had described essentially the same algorithm around 1805 for interpolating asteroid orbits -- predating Cooley-Tukey by 160 years. The Number Theoretic Transform (NTT) was developed in the 1970s as the modular-arithmetic analog, avoiding floating-point entirely.
Mnemonic Anchor
"Evaluate, multiply pointwise, interpolate --
beats by splitting at roots of unity."
Recap and Cross-Links
| Concept | Link |
|---|---|
| Math Fundamentals | 01-math-fundamentals |
| Number Theory | 02-number-theory |
| Divide and Conquer | 14-divide-and-conquer |
| Practice | Ladder - Mixed |