FFT & NTT (Polynomial Multiplication)

Quick Revisit
USE WHEN: polynomial multiplication and sum convolution (counting pairs with sum = k). For XOR convolution use FWHT (Walsh-Hadamard transform), not FFT.
INVARIANT: evaluate at n points, multiply pointwise, interpolate back -- convolution theorem
TIME: O(n log n) | SPACE: O(n)
CLASSIC BUG: array size not padded to next power of 2 -- causes out-of-bounds or wrong results
PRACTICE: ../12-practice-ladders/08-ladder-mixed.md

Difficulty: [Advanced] CF 1900-2200+ Prerequisites: Math Fundamentals, Number Theory, Divide and ConquerMotivation: When you need to multiply polynomials or compute convolutions in $O (n \log n)$ instead of $O (n^{2})$ .

Contest Frequency: [*] Specialized -- appears in polynomial/convolution problems

Intuition
How It Works (Detailed Mechanics)
Implementation
Patterns & Applications
Complexity Analysis
Edge Cases & Gotchas
Debugging Checklist
Worked Examples
Self-Test
Practice Problems
Summary

FFT vs NTT — pick before you implement

The two transforms share the same convolution-theorem skeleton but differ in arithmetic:

	FFT	NTT
Field	Complex numbers (`double` / `long double`)	$Z / p Z$ for a special prime $p$
"Roots of unity"	$e^{2 π i k / n}$ on the unit circle	A primitive $n$ -th root of unity mod $p$ (e.g. 3 mod 998244353)
Exact?	No — accumulates floating-point rounding error	Yes — every operation is an integer mod $p$
Risk	Loss of precision when coefficients × n is large	Modulus must support $n$ as a divisor of $p - 1$ ; works only for "NTT-friendly" primes
Use when	Real-valued / complex coefficients, or output fits comfortably in `double` precision	Answer is required modulo a specific prime (typically $998244353 = 119 \cdot 2^{23} + 1$ )

Decision rule: if the problem asks for the answer modulo a fixed NTT-friendly prime (especially $998244353$ or $10^{9} + 7$ via three-prime NTT), reach for NTT. If the coefficients are small reals or the modulus is arbitrary and you can afford the precision loss, FFT is fine. Mixing them inside one routine is rarely worth it for contest scope.

Intuition

The Pain

Multiply $A (x) = 1 + 2 x + 3 x^{2}$ and $B (x) = 4 + 5 x$ .

The naive approach: every coefficient of $A$ multiplies every coefficient of $B$ .

          1   2   3       (coefficients of A)
      x   4   5           (coefficients of B)
      -----------
          5  10  15       (multiply by 5, shift by 1)
      +   4   8  12       (multiply by 4, shift by 0)
      -----------
          4  13  22  15   => C(x) = 4 + 13x + 22x^2 + 15x^3

This is exactly like long multiplication. For degree- $n$ polynomials, that is $O (n^{2})$ multiplications. When $n = 10^{5}$ or $10^{6}$ , this is far too slow.

The wall: Any approach that touches every pair of coefficients is $O (n^{2})$ .

The Key Insight

"Evaluate both polynomials at $n$ special points, multiply pointwise in $O (n)$ , then interpolate back -- the magic is that roots of unity make evaluation and interpolation $O (n \log n)$ via divide and conquer."

Why does this work? A polynomial of degree $n - 1$ is uniquely determined by its values at $n$ distinct points. So the workflow is:

Coefficient repr.          Point-value repr.          Coefficient repr.
   A = [a0..an-1]  ---->  A(w0), A(w1), ..., A(wn-1)
   B = [b0..bn-1]  ---->  B(w0), B(w1), ..., B(wn-1)
                                    |
                           multiply pointwise O(n)
                                    |
                           C(wi) = A(wi) * B(wi)
                                    |
   C = [c0..c2n-2] <----  interpolate back (inverse FFT)

The "special points" are the $n$ -th roots of unity: $w_{k} = e^{2 π i k / n}$ for $k = 0, 1, \dots, n - 1$ . Their algebraic structure -- specifically the halving property $w_{n}^{2 k} = w_{n / 2}^{k}$ -- enables the divide and conquer split.

Visual Walkthrough

Step 1: Evaluate (FFT forward)

Split polynomial into even and odd indexed coefficients: $A (x) = A_{even} (x^{2}) + x \cdot A_{odd} (x^{2})$

For $A (x) = a_{0} + a_{1} x + a_{2} x^{2} + a_{3} x^{3}$ :

$A_{even} (y) = a_{0} + a_{2} y$
$A_{odd} (y) = a_{1} + a_{3} y$

Then $A (w_{k}) = A_{even} (w_{k}^{2}) + w_{k} \cdot A_{odd} (w_{k}^{2})$ .

Since $w_{n}^{2}$ cycles through the $(n / 2)$ -th roots of unity, we recurse on half the size!

Butterfly diagram (n=8, showing one level):

Input:  a0  a1  a2  a3  a4  a5  a6  a7
        |       |       |       |
        +--+-+--+  split even/odd
        |  | |  |
Even:  a0  a2  a4  a6      Odd:  a1  a3  a5  a7
        \       /                  \       /
     FFT(n/2)                   FFT(n/2)
        |                          |
    E0 E1 E2 E3              O0 O1 O2 O3
     \  |  |  /                \  |  |  /
      combine with twiddle factors w^k:
      y[k]     = E[k] + w^k * O[k]       (k = 0..n/2-1)
      y[k+n/2] = E[k] - w^k * O[k]       (k = 0..n/2-1)

Full butterfly diagram (n=8) -- all levels of the iterative FFT:

  FFT BUTTERFLY NETWORK (n=8, iterative, bottom-up)
  Inputs in bit-reversed order on the left; output = DFT values on the right.

  Level 0 (input)     Level 1 (2-pt DFTs)   Level 2 (4-pt DFTs)   Level 3 (8-pt DFT)
  bit-reversed idx    combine pairs          combine quads          combine all 8
  ==================  ====================  ====================  ==================
  a0 (000) ----*------+----- X0' ---*------+------ X0'' --*------+------ Y0
               |  w^0 |             |  w^0 |               |  w^0 |
  a4 (100) ----*------+----- X1' ---+------+------ X1'' ---+------+------ Y1
                                    |  w^1 |               |  w^1 |
  a2 (010) ----*------+----- X2' ---*------+------ X2'' ---+------+------ Y2
               |  w^0 |                                    |  w^2 |
  a6 (110) ----*------+----- X3' ---+------+------ X3'' ---+------+------ Y3
                                    |      |               |  w^3 |
  a1 (001) ----*------+----- X4' ---*------+------ X4'' --*------+------ Y4
               |  w^0 |             |  w^0 |               |      |
  a5 (101) ----*------+----- X5' ---+------+------ X5'' ---+------+------ Y5
                                    |  w^1 |               |      |
  a3 (011) ----*------+----- X6' ---*------+------ X6'' ---+------+------ Y6
               |  w^0 |                                    |      |
  a7 (111) ----*------+----- X7' ---+------+------ X7'' ---+------+------ Y7

  Each ---*---+--- is a butterfly:  top = u + w^k * v
                                    bot = u - w^k * v

  Level 1: stride=1, twiddle w^0 (all trivial +/- pairs)
  Level 2: stride=2, twiddles w^0, w^1  (w = w_4 = e^{2*pi*i/4} = i)
  Level 3: stride=4, twiddles w^0, w^1, w^2, w^3  (w = w_8 = e^{2*pi*i/8})

8th roots of unity on the complex plane:

                    Im
                    ^
            w^2     |     w^1
         (-0.7+0.7i)|  (0.7+0.7i)
              *     |     *
                \   |   /
    w^3          \  |  /         w^0
    (-1+0i) *-----\-+-/-------* (1+0i) --> Re
                  / | \
                /   |   \
              *     |     *
         (-0.7-0.7i)|  (0.7-0.7i)
            w^4     |     w^7
        (= -1)      |
              *     |     *
            w^5     |     w^6
         (-0.7-0.7i)|  (0.7-0.7i)

  w_8 = e^{2*pi*i/8},  w^k = (cos(2*pi*k/8), sin(2*pi*k/8))

  Key identity:  (w^k)^2 = w^{2k}  maps 8th roots to 4th roots
                 w^0=1   w^1=(1+i)/sqrt2   w^2=i   w^3=(-1+i)/sqrt2
                 w^4=-1  w^5=(-1-i)/sqrt2  w^6=-i  w^7=(1-i)/sqrt2

  The HALVING PROPERTY: squaring w_8^k gives w_4^k
    w_8^0 -> w_4^0,  w_8^1 -> w_4^1,  w_8^2 -> w_4^2,  w_8^3 -> w_4^3
    w_8^4 -> w_4^0,  w_8^5 -> w_4^1,  w_8^6 -> w_4^2,  w_8^7 -> w_4^3
  This is WHY the even/odd split works: both halves evaluate at SAME points!

The Convolution Theorem -- why FFT makes polynomial multiplication fast:

  TIME DOMAIN                              FREQUENCY DOMAIN
  (coefficient representation)             (point-value representation)
  +---------------------------+            +---------------------------+
  |  A = [a0, a1, ..., an-1]  |  --FFT-->  |  A^ = [A(w^0), ..., A(w^{n-1})] |
  |  B = [b0, b1, ..., bn-1]  |  --FFT-->  |  B^ = [B(w^0), ..., B(w^{n-1})] |
  +---------------------------+            +---------------------------+
            |                                         |
            | Naive multiply:                         | Pointwise multiply:
            | O(n^2)                                  | O(n)
            |                                         |
            v                                         v
  +---------------------------+            +---------------------------+
  |  C = A * B                |  <--IFFT-- |  C^ = A^ . B^            |
  |  (convolution, O(n^2))    |            |  C^[k] = A^[k] * B^[k]   |
  +---------------------------+            +---------------------------+

  Total cost:  FFT(A) + FFT(B) + pointwise + IFFT(C)
             = O(n log n) + O(n log n) + O(n) + O(n log n)
             = O(n log n)

  THE KEY: convolution in time domain = pointwise product in freq domain.
           FFT converts between the two in O(n log n).

Step 2: Pointwise multiply -- trivially $O (n)$ .

Step 3: Interpolate (inverse FFT) -- same algorithm, but use $w^{- 1}$ instead of $w$ , and divide all results by $n$ .

The Invariant

"At each recursion level, the polynomial is split into even and odd coefficients, evaluated at roots of unity of half the order."

More precisely: at recursion depth $d$ , we evaluate $n / 2^{d}$ sub-polynomials, each of degree $2^{d} - 1$ , at the $2^{d}$ -th roots of unity. The butterfly recombination at each level is $O (n)$ work, and there are $\log_{2} n$ levels, giving $O (n \log n)$ total.

Why $n$ must be a power of 2: The halving property requires $n$ to divide evenly at every level. We pad the polynomial to the next power of 2.

When to Reach for This

Trigger phrases in problem statements:

Signal	Translation
"polynomial multiplication"	Direct FFT/NTT
"convolution" / "convolve"	FFT/NTT
"count ways with sum constraints"	Generating function + FFT
" $n \leq 10^{6}$ " with combinatorial sums	Probably needs FFT
"number of pairs with $a_{i} + a_{j} = k$ "	Frequency array convolution
"string matching with wildcards"	Convolution trick
"XOR/OR/AND convolution"	Bitwise convolution variant
"answer modulo $998244353$ "	NTT-friendly prime, strong hint

Quick decision:

Floating point OK, moderate $n$ ? Use FFT (doubles).
Need exact answers mod a prime? Use NTT.
$998244353 = 119 \cdot 2^{23} + 1$ has primitive root $3$ and supports NTT up to $n = 2^{23} \approx 8 \times 10^{6}$ .

Insight Gaps

Recognizing when a problem needs convolution. The hardest part of FFT problems is not implementing FFT -- it's seeing that the problem is a convolution in disguise. Triggers:

"Count pairs $(i, j)$ with $a_{i} + a_{j} = k$ for each $k$ " --> self-convolution of frequency array
"Multiply two polynomials" --> direct application
"Count ways to roll $n$ dice summing to $k$ " --> multiply generating functions
"String matching with wildcards" --> cross-correlation via FFT
"XOR convolution" --> Walsh-Hadamard transform (different algorithm, same structure)

WHT for XOR/OR/AND convolutions. When the "index combination" operation is XOR/OR/AND instead of addition, standard FFT doesn't apply. The Walsh-Hadamard Transform (WHT) handles XOR convolution; sum-over-subsets (SOS/zeta transform) handles OR/AND. These are different transforms from FFT but follow the same "forward transform, pointwise operation, inverse transform" structure.

CONVOLUTION TYPE --> ALGORITHM DECISION:

  C[k] = sum_{i + j = k}  A[i] * B[j]    --> FFT / NTT
  C[k] = sum_{i XOR j = k} A[i] * B[j]   --> WHT (Walsh-Hadamard)
  C[k] = sum_{i OR  j = k} A[i] * B[j]   --> Zeta (sum-over-subsets)
  C[k] = sum_{i AND j = k} A[i] * B[j]   --> Zeta (superset-sum)

  All follow the same recipe:
    forward_transform(A), forward_transform(B)
    pointwise multiply
    inverse_transform(result)

  Only the transform function changes.

The bit-reversal permutation. The iterative FFT reorders input by bit-reversal of indices before the butterfly passes. Understanding why (the recursive splitting by even/odd indices produces a specific permutation, and bit-reversal is its inverse) is the difference between blindly using the implementation and being able to debug it.

What the Code Won't Teach You

The FFT code converts between coefficient representation and point-value representation. What it hides is why this is useful: pointwise multiplication is $O (n)$ , so the bottleneck is just the two transforms.

THE FFT RECIPE -- what the code actually does:

  Coefficient form            Point-value form
  ┌─────────────┐    FFT     ┌─────────────┐
  │ A = [a₀..aₙ]│──────────->│ A(ω⁰)..A(ωⁿ)│
  └─────────────┘            └──────┬──────┘
                                    │ pointwise
  ┌─────────────┐    FFT     ┌──────┴──────┐   multiply
  │ B = [b₀..bₙ]│──────────->│ B(ω⁰)..B(ωⁿ)│──-> C(ωᵏ)=A(ωᵏ)*B(ωᵏ)
  └─────────────┘            └─────────────┘
                                                    │
  ┌─────────────┐  inv FFT   ┌─────────────┐       │
  │ C = [c₀..cₙ]│<-──────────│ C(ω⁰)..C(ωⁿ)│<-──────┘
  └─────────────┘            └─────────────┘

  Cost: O(n log n) + O(n) + O(n log n) = O(n log n)
  vs. naive coefficient multiply: O(n²)

The code also won't teach you the evaluation-interpolation duality. FFT evaluates a polynomial at $n$ roots of unity ( $O (n \log n)$ ). Inverse FFT interpolates back from those values. This duality is why the same algorithm (with conjugated roots and a $1 / n$ factor) works for both directions.

BIT-REVERSAL PERMUTATION for n=8:

  Index (decimal): 0  1  2  3  4  5  6  7
  Index (binary):  000 001 010 011 100 101 110 111
  Reversed:        000 100 010 110 001 101 011 111
  New position:    0  4  2  6  1  5  3  7

  Input array: [a₀ a₁ a₂ a₃ a₄ a₅ a₆ a₇]
  After perm:  [a₀ a₄ a₂ a₆ a₁ a₅ a₃ a₇]

  This reordering makes the butterfly stages work in-place:

  Stage 1: pairs (0,1)(2,3)(4,5)(6,7) -- stride 1
  Stage 2: pairs (0,2)(1,3)(4,6)(5,7) -- stride 2
  Stage 3: pairs (0,4)(1,5)(2,6)(3,7) -- stride 4

Finally, when precision fails (coefficients exceed ~ $10^{15}$ ), the code gives wrong answers silently -- no runtime error, just slightly off integers after llround. The NTT avoids this entirely by working in exact modular arithmetic, but only for NTT-friendly primes.

The code also won't teach you the deep connection between FFT and NTT: both are instances of the Discrete Fourier Transform over different algebraic structures. FFT works over $C$ using complex roots of unity; NTT works over $Z / p Z$ using primitive roots of unity modulo $p$ . The butterfly structure is identical -- only the "twiddle factors" change.

WHY 998244353 IS THE CANONICAL NTT PRIME:

  998244353 = 119 x 2²³ + 1

  Key properties:
  ┌─────────────────────────────────────────────────┐
  │ 1. It's prime              -> modular inverse    │
  │ 2. p-1 = 119 x 2²³        -> 2²³ divides p-1    │
  │ 3. Max NTT size = 2²³     -> ~8 million elements │
  │ 4. Primitive root g = 3   -> verified, standard   │
  └─────────────────────────────────────────────────┘

  Compare with 10⁹+7 = 1000000007:
  ┌─────────────────────────────────────────────────┐
  │ 10⁹+7 - 1 = 2 x 500000003                      │
  │ Largest power of 2 dividing (p-1) = 2¹          │
  │ Max NTT size = 2  <- USELESS for NTT!            │
  └─────────────────────────────────────────────────┘

  When you need mod 10⁹+7: use split-coefficient FFT
  (3-4 real FFTs on 15-bit halves) or multi-prime NTT
  with CRT reconstruction.

BUTTERFLY DIAGRAM for n=8 FFT:

  a₀ ──┬──────────┬──────────┬──── X₀
  a₄ ──┤  ω⁰     ┤  ω⁰     ┤
       ├──────────┤          │
  a₂ ──┤          ├──────────┤──── X₁
  a₆ ──┤  ω⁰     │  ω¹     │
       ├──────────┤          │
  a₁ ──┤          ├──────────┤──── X₂
  a₅ ──┤  ω⁰     │  ω²     │
       ├──────────┤          │
  a₃ ──┤          ├──────────┤──── X₃
  a₇ ──┘  ω⁰     ┘  ω³     ┘

  Each stage: combine pairs with twiddle factor ωᵏ
  3 stages for n=8 (log₂8 = 3)

The connection between FFT and NTT: both are DFT, just over different rings. FFT computes the Discrete Fourier Transform over $C$ using complex roots of unity $e^{2 π i / n}$ . NTT computes the same DFT over $Z / p Z$ using a primitive root of unity modulo $p$ . The butterfly structure, the divide-and-conquer recurrence, and the bit-reversal permutation are identical -- only the "numbers" and their arithmetic change. This is why your FFT template converts to NTT by swapping complex<double> with long long and replacing $ω = e^{2 π i / n}$ with $g^{(p - 1) / n} mod p$ .

The evaluation-interpolation duality is the soul of FFT. The FFT code performs two operations: evaluate a polynomial at $n$ roots of unity (forward FFT) and recover coefficients from values at those points (inverse FFT). These are exact inverses of each other because the DFT matrix $F$ and its inverse $F^{- 1} = \frac{1}{n} \bar{F}$ are related by complex conjugation and scaling. Understanding this duality means you can derive the inverse FFT from the forward FFT without memorizing it separately -- just conjugate the roots and divide by $n$ . It also explains why the same butterfly structure works in both directions.

Why $998244353$ is the competitive programmer's favorite NTT prime. $998244353 = 119 \times 2^{23} + 1$ . The factor $2^{23}$ means this prime supports NTT of size up to $2^{23} = 8,388,608$ -- large enough for almost any contest problem. A primitive root is $g = 3$ . Compare with $10^{9} + 7 = 500000003 \times 2 + 1$ : the factor of $2$ is only $2^{1}$ , so it only supports NTT of size $2$ -- useless in practice.

FFT vs NTT -- SAME ALGORITHM, DIFFERENT RINGS:

  ┌──────────────────────┬───────────────────────────────┐
  │        FFT            │            NTT                │
  ├──────────────────────┼───────────────────────────────┤
  │ Ring: ℂ (complex)     │ Ring: ℤ/pℤ (mod prime p)     │
  │ Root: ω = e^(2πi/n)  │ Root: ω = g^((p-1)/n) mod p  │
  │ Inverse: conjugate    │ Inverse: modular inverse      │
  │ Scaling: ÷ n          │ Scaling: x n⁻¹ mod p         │
  │ Precision: <= 10¹⁵     │ Precision: exact (mod p)     │
  ├──────────────────────┴───────────────────────────────┤
  │  Butterfly structure, bit-reversal, O(n log n)       │
  │  -- ALL identical between FFT and NTT                 │
  └──────────────────────────────────────────────────────┘

  WHY 998244353?

  998244353  =  119 x 2²³ + 1
                      ^^^
                      max NTT size = 2²³ = 8,388,608
                      primitive root g = 3

  Compare:
  10⁹ + 7   =  500000003 x 2¹ + 1   ->  max NTT size = 2  (useless!)
  985661441  =  235 x 2²²  + 1       ->  max NTT size = 2²² = 4,194,304
  469762049  =  7   x 2²⁶  + 1       ->  max NTT size = 2²⁶ (even bigger!)

How It Works (Detailed Mechanics)

Roots of Unity Primer

The $n$ -th roots of unity are the $n$ complex numbers satisfying $z^{n} = 1$ : $w_{n}^{k} = e^{2 π i k / n} = \cos (2 π k / n) + i \sin (2 π k / n)$

Key properties:

Periodicity: $w_{n}^{k + n} = w_{n}^{k}$
Halving: $w_{n}^{2 k} = w_{n / 2}^{k}$ (the crucial one for FFT)
Cancellation: $w_{n}^{n / 2} = - 1$ , so $w_{n}^{k + n / 2} = - w_{n}^{k}$
Summation: $\sum_{k = 0}^{n - 1} w_{n}^{j k} = 0$ for $j ≢ 0 (\mod n)$

Property 3 gives us the "butterfly": we compute $E [k] + w^{k} O [k]$ and $E [k] - w^{k} O [k]$ simultaneously.

Cooley-Tukey (Recursive FFT)

FFT(a[0..n-1], w):
  if n == 1: return a
  even = FFT(a[0], a[2], ..., a[n-2], w^2)
  odd  = FFT(a[1], a[3], ..., a[n-1], w^2)
  t = 1
  for k = 0 to n/2 - 1:
    a[k]       = even[k] + t * odd[k]
    a[k + n/2] = even[k] - t * odd[k]
    t = t * w
  return a

Iterative FFT (Bit-Reversal)

The recursive version has overhead from memory allocation. The iterative version works in-place by first permuting the array according to bit-reversal of indices, then building up the butterfly from small to large.

Bit-reversal example ( $n = 8$ , 3 bits):

Index:     0    1    2    3    4    5    6    7
Binary:  000  001  010  011  100  101  110  111
Reverse: 000  100  010  110  001  101  011  111
Decimal:   0    4    2    6    1    5    3    7

After bit-reversal, the iterative butterflies naturally combine adjacent pairs, then groups of 4, then 8, etc.

FFT vs NTT -- SAME BUTTERFLY, DIFFERENT ARITHMETIC:

  FFT butterfly (complex):            NTT butterfly (modular):
  
  u = a[j]                            u = a[j]
  v = a[j + half] * w                 v = a[j + half] * w % MOD
  
  a[j]        = u + v                 a[j]        = (u + v) % MOD
  a[j + half] = u - v                 a[j + half] = (u - v + MOD) % MOD
  
  w *= wn                             w = w * wn % MOD
  
  ┌─────────────────┬──────────────────────────┐
  │     FFT          │          NTT             │
  ├─────────────────┼──────────────────────────┤
  │ w = e^(2πi/n)   │ w = g^((p-1)/n) mod p   │
  │ complex mul     │ modular mul              │
  │ ÷ n at end      │ x n⁻¹ mod p at end      │
  │ precision ~10¹⁵ │ exact (mod p)            │
  │ uses sin/cos    │ uses modpow              │
  └─────────────────┴──────────────────────────┘

NTT: Modular Arithmetic Version

Replace complex roots of unity with primitive roots modulo a prime.

For prime $p$ with $p - 1 = c \cdot 2^{k}$ , we can do NTT of size up to $2^{k}$ . The primitive $n$ -th root of unity mod $p$ is $g^{(p - 1) / n}$ where $g$ is a primitive root of $p$ .

Common NTT-friendly primes:

Prime            p-1 factorization       Max NTT size    Prim. root
998244353        119 * 2^23              2^23 = 8388608       3
985661441        235 * 2^22              2^22 = 4194304       3
469762049        7   * 2^26             2^26 = 67108864       3
167772161        5   * 2^25             2^25 = 33554432       3

Inverse NTT: Same as NTT but with $g^{- 1}$ instead of $g$ , and multiply all results by $n^{- 1} (\mod p)$ .

Implementation

Recursive FFT (Educational)

cpp

#include <bits/stdc++.h>
using namespace std;

using cd = complex<double>;
const double PI = acos(-1.0);

void fft_recursive(vector<cd>& a, bool invert) {
    int n = a.size();
    if (n == 1) return;

    vector<cd> even(n / 2), odd(n / 2);
    for (int i = 0; i < n / 2; i++) {
        even[i] = a[2 * i];
        odd[i]  = a[2 * i + 1];
    }
    fft_recursive(even, invert);
    fft_recursive(odd, invert);

    double ang = 2 * PI / n * (invert ? -1 : 1);
    cd w(1), wn(cos(ang), sin(ang));
    for (int i = 0; i < n / 2; i++) {
        a[i]         = even[i] + w * odd[i];
        a[i + n / 2] = even[i] - w * odd[i];
        w *= wn;
    }
}

Iterative FFT (Contest-Ready, Doubles)

cpp

#include <bits/stdc++.h>
using namespace std;

using cd = complex<double>;
const double PI = acos(-1.0);

void fft(vector<cd>& a, bool invert) {
    int n = a.size();
    // Bit-reversal permutation
    for (int i = 1, j = 0; i < n; i++) {
        int bit = n >> 1;
        for (; j & bit; bit >>= 1)
            j ^= bit;
        j ^= bit;
        if (i < j) swap(a[i], a[j]);
    }
    // Butterfly computation
    for (int len = 2; len <= n; len <<= 1) {
        double ang = 2 * PI / len * (invert ? -1 : 1);
        cd wlen(cos(ang), sin(ang));
        for (int i = 0; i < n; i += len) {
            cd w(1);
            for (int j = 0; j < len / 2; j++) {
                cd u = a[i + j];
                cd v = a[i + j + len / 2] * w;
                a[i + j]           = u + v;
                a[i + j + len / 2] = u - v;
                w *= wlen;
            }
        }
    }
    if (invert) {
        for (auto& x : a) x /= n;
    }
}

// Multiply two polynomials given as coefficient vectors.
// Returns coefficient vector of the product.
vector<long long> multiply(vector<int>& a, vector<int>& b) {
    vector<cd> fa(a.begin(), a.end()), fb(b.begin(), b.end());
    int n = 1;
    while (n < (int)(a.size() + b.size() - 1)) n <<= 1;
    fa.resize(n);
    fb.resize(n);

    fft(fa, false);
    fft(fb, false);
    for (int i = 0; i < n; i++) fa[i] *= fb[i];
    fft(fa, true);

    vector<long long> res(n);
    for (int i = 0; i < n; i++) res[i] = llround(fa[i].real());
    return res;
}

Precision note: With double, reliable for coefficients up to about $10^{14}$ . For larger values or exact modular results, use NTT.

NTT (Contest-Ready, mod 998244353)

cpp

#include <bits/stdc++.h>
using namespace std;

const int MOD = 998244353;
const int PRIM_ROOT = 3;

long long power(long long a, long long b, long long mod) {
    long long res = 1;
    a %= mod;
    while (b > 0) {
        if (b & 1) res = res * a % mod;
        a = a * a % mod;
        b >>= 1;
    }
    return res;
}

long long inv(long long a, long long mod) {
    return power(a, mod - 2, mod);
}

void ntt(vector<long long>& a, bool invert) {
    int n = a.size();
    // Bit-reversal permutation
    for (int i = 1, j = 0; i < n; i++) {
        int bit = n >> 1;
        for (; j & bit; bit >>= 1)
            j ^= bit;
        j ^= bit;
        if (i < j) swap(a[i], a[j]);
    }
    // Butterfly computation
    for (int len = 2; len <= n; len <<= 1) {
        long long w = invert ? inv(PRIM_ROOT, MOD) : PRIM_ROOT;
        w = power(w, (MOD - 1) / len, MOD);
        for (int i = 0; i < n; i += len) {
            long long wn = 1;
            for (int j = 0; j < len / 2; j++) {
                long long u = a[i + j];
                long long v = a[i + j + len / 2] * wn % MOD;
                a[i + j]           = (u + v) % MOD;
                a[i + j + len / 2] = (u - v + MOD) % MOD;
                wn = wn * w % MOD;
            }
        }
    }
    if (invert) {
        long long n_inv = inv(n, MOD);
        for (auto& x : a) x = x * n_inv % MOD;
    }
}

// Multiply two polynomials modulo MOD.
vector<long long> multiply_ntt(vector<long long>& a, vector<long long>& b) {
    vector<long long> fa(a), fb(b);
    int n = 1;
    while (n < (int)(a.size() + b.size() - 1)) n <<= 1;
    fa.resize(n);
    fb.resize(n);

    ntt(fa, false);
    ntt(fb, false);
    for (int i = 0; i < n; i++) fa[i] = fa[i] * fb[i] % MOD;
    ntt(fa, true);

    fa.resize(a.size() + b.size() - 1);
    return fa;
}

Convolution Wrapper (Frequency Array)

A very common pattern: given arrays of counts, compute the "sum convolution."

cpp

// Given freq[v] = number of elements with value v,
// compute conv[s] = number of pairs (i, j) with a[i] + a[j] = s.
vector<long long> sum_convolution(vector<int>& freq) {
    // Self-convolution via FFT
    vector<cd> fa(freq.begin(), freq.end());
    int n = 1;
    while (n < (int)(2 * freq.size() - 1)) n <<= 1;
    fa.resize(n);
    fft(fa, false);
    for (int i = 0; i < n; i++) fa[i] *= fa[i];
    fft(fa, true);
    vector<long long> res(n);
    for (int i = 0; i < n; i++) res[i] = llround(fa[i].real());
    return res;
}

Arbitrary Modulus Convolution (Split Trick)

When the modulus is not NTT-friendly, split coefficients into high and low 15-bit halves and use 3 FFTs:

cpp

// Multiply polynomials mod any prime using FFT with doubles.
// Split each coefficient a[i] = a_hi[i] * BASE + a_lo[i].
// Then a*b = (a_hi*b_hi)*BASE^2 + (a_hi*b_lo + a_lo*b_hi)*BASE + a_lo*b_lo
// Each sub-product is computed via FFT on doubles (fits in precision).
vector<long long> multiply_mod(vector<int>& a, vector<int>& b, long long mod) {
    const int BASE = 1 << 15; // 32768
    int need = a.size() + b.size() - 1;
    int n = 1;
    while (n < need) n <<= 1;

    vector<cd> a_lo(n), a_hi(n), b_lo(n), b_hi(n);
    for (int i = 0; i < (int)a.size(); i++) {
        a_lo[i] = a[i] % BASE;
        a_hi[i] = a[i] / BASE;
    }
    for (int i = 0; i < (int)b.size(); i++) {
        b_lo[i] = b[i] % BASE;
        b_hi[i] = b[i] / BASE;
    }

    fft(a_lo, false); fft(a_hi, false);
    fft(b_lo, false); fft(b_hi, false);

    // Combine: 4 pointwise products
    vector<cd> hh(n), hl(n), lh(n), ll(n);
    for (int i = 0; i < n; i++) {
        hh[i] = a_hi[i] * b_hi[i];
        hl[i] = a_hi[i] * b_lo[i];
        lh[i] = a_lo[i] * b_hi[i];
        ll[i] = a_lo[i] * b_lo[i];
    }

    fft(hh, true); fft(hl, true); fft(lh, true); fft(ll, true);

    vector<long long> res(need);
    for (int i = 0; i < need; i++) {
        long long v_hh = llround(hh[i].real()) % mod;
        long long v_hl = llround(hl[i].real()) % mod;
        long long v_lh = llround(lh[i].real()) % mod;
        long long v_ll = llround(ll[i].real()) % mod;
        res[i] = ((v_hh % mod * BASE % mod * BASE % mod)
                + (v_hl + v_lh) % mod * BASE % mod
                + v_ll) % mod;
        res[i] = (res[i] % mod + mod) % mod;
    }
    return res;
}

Patterns & Applications

Pattern 1: Direct Polynomial Multiplication

When: The problem literally asks you to multiply two polynomials or compute a convolution of two sequences.

Given two polynomials A and B, compute C = A * B.

Solution: Direct FFT/NTT multiply.
Time: O(n log n)

Pattern 2: Counting with Generating Functions

When: "Count the number of ways to pick items whose values sum to $S$ ."

Idea: Create a generating function where the coefficient of $x^{k}$ represents the number of ways to achieve sum $k$ . Multiplying generating functions corresponds to combining independent choices.

Example: You have $n$ dice, each showing values $1$ to $6$ . Count the number of ways to get each possible total.

cpp

// Each die contributes: x^1 + x^2 + x^3 + x^4 + x^5 + x^6
// For n dice, multiply this polynomial with itself n times.
// Use repeated squaring: O(n * S * log n) via FFT, where S = 6n.
vector<long long> dice_ways(int n) {
    // Single die polynomial: coefficients for x^1..x^6
    vector<long long> single = {0, 1, 1, 1, 1, 1, 1};
    vector<long long> result = {1}; // identity: x^0

    // Binary exponentiation on polynomials
    auto poly = single;
    int exp = n;
    while (exp > 0) {
        if (exp & 1) result = multiply_ntt(result, poly);
        poly = multiply_ntt(poly, poly);
        exp >>= 1;
    }
    return result; // result[s] = number of ways to get sum s
}

CONVOLUTION TYPE -> ALGORITHM DECISION TREE:

  What operation combines indices?
  │
  ├─ i + j = k  (additive)
  │  ├─ Need exact mod p?
  │  │  ├─ p is NTT-friendly (p = c*2ᵏ+1, k large) -> NTT mod p
  │  │  ├─ p = 10⁹+7 (not NTT-friendly) -> split-coefficient FFT (7 FFTs)
  │  │  └─ arbitrary p -> CRT with 2-3 NTT-friendly primes
  │  └─ No modulus / small values -> FFT on doubles + llround
  │
  ├─ i OR j = k  -> SOS (zeta/Möbius) transform
  ├─ i AND j = k -> superset-sum transform
  ├─ i XOR j = k -> Walsh-Hadamard Transform (WHT)
  │
  └─ i * j = k (multiplicative) -> Dirichlet convolution
     (no fast transform in general; use factorization tricks)

CHOOSING YOUR FFT/NTT VARIANT -- QUICK REFERENCE:

  ┌─────────────────────────────────────────────────────────────┐
  │                   START HERE                                │
  │                      │                                      │
  │          Need modular arithmetic?                           │
  │           ╱              ╲                                   │
  │         YES               NO                                │
  │          │                 │                                 │
  │    What modulus?      Values small                           │
  │     ╱    │     ╲      enough for                             │
  │   NTT-  10⁹+7  Other doubles?                               │
  │  friendly        mod    ╱    ╲                               │
  │    │       │      │   YES     NO                             │
  │    ↓       ↓      ↓    ↓      ↓                             │
  │   NTT   Split   CRT  FFT   Multi-                           │
  │   mod p  coeff  with  with  prime                            │
  │         FFT    2-3    llround NTT                            │
  │         (7     NTT          + CRT                            │
  │         FFTs)  primes                                        │
  └─────────────────────────────────────────────────────────────┘

  Decision factors:
  ┌──────────────────┬────────────┬──────────────────────┐
  │ Variant          │ Speed      │ When to use          │
  ├──────────────────┼────────────┼──────────────────────┤
  │ NTT mod p        │ Fastest    │ p is NTT-friendly    │
  │ FFT doubles      │ Fast       │ n*v² < 10¹⁵         │
  │ Split-coeff FFT  │ ~7x FFT   │ mod 10⁹+7           │
  │ Multi-prime NTT  │ ~3x NTT   │ arbitrary modulus    │
  └──────────────────┴────────────┴──────────────────────┘

Pattern 3: Frequency Convolution (Pair Counting)

When: "How many pairs $(i, j)$ satisfy $a_{i} + a_{j} = k$ for each $k$ ?"

Build a frequency array freq[v] and self-convolve. The result at index $k$ counts ordered pairs with sum $k$ .

cpp

// freq[v] = count of elements equal to v
// After convolution: res[k] = sum over all v of freq[v] * freq[k-v]
//                            = number of ordered pairs summing to k

Variations:

Difference instead of sum: reverse one array, then convolve.
Product: take logarithms to convert to sums (with care for precision).

FREQUENCY SELF-CONVOLUTION -- counting pair sums:

  Array: [1, 3, 2, 3, 5]
  
  Frequency:  freq[1]=1, freq[2]=1, freq[3]=2, freq[5]=1
  
  Convolution C = freq ⊛ freq:
  
  C[k] = Σ freq[i] * freq[k-i]
  
  C[2] = freq[1]*freq[1] = 1     (pair: 1+1)
  C[3] = freq[1]*freq[2] + freq[2]*freq[1] = 2  (pairs: 1+2, 2+1)
  C[4] = freq[1]*freq[3] + freq[2]*freq[2] + freq[3]*freq[1] = 5
  C[6] = freq[1]*freq[5] + freq[3]*freq[3] + freq[5]*freq[1] = 6
  
  Value:  0   1   2   3   4   5   6   7   8  ...
  freq:   0   1   1   2   0   1   0   0   0
  conv:   0   0   1   2   5   4   6   4   1  ...
              ^               ^
              |               |
         1+1=2 (1 way)  1+3=4, 2+2=4, 3+1=4 (5 ordered ways)

Pattern 4: String Matching via Convolution

When: "Find all positions where pattern $P$ matches text $T$ , allowing wildcards."

Encode characters as numbers. A match at position $s$ means $\sum_{j = 0}^{m - 1} (T [s + j] - P [j])^{2} \cdot w_{T} [s + j] \cdot w_{P} [j] = 0$ where $w$ is $0$ at wildcard positions, $1$ otherwise.

Expand the squared term and compute each part via convolution (reversing $P$ ).

cpp

// Simplified: wildcard = 0 in both T and P
// Match score at position s:
//   sum_j T[s+j]^2 * P[j] - 2 * T[s+j] * P[j]^2 + P[j]^3 ... (expanded)
// Each sum is a convolution after reversing P.

Pattern 5: Bitwise Convolution (OR/AND/XOR)

Standard convolution computes $C [k] = \sum_{i + j = k} A [i] \cdot B [j]$ .

OR convolution: $C [k] = \sum_{i | j = k} A [i] \cdot B [j]$ Uses the sum-over-subsets (SOS) / Mobius transform instead of FFT.

XOR convolution: $C [k] = \sum_{i \oplus j = k} A [i] \cdot B [j]$ Uses the Walsh-Hadamard Transform (WHT).

cpp

// Walsh-Hadamard Transform (XOR convolution)
void wht(vector<long long>& a, bool invert) {
    int n = a.size();
    for (int len = 1; len < n; len <<= 1) {
        for (int i = 0; i < n; i += len << 1) {
            for (int j = 0; j < len; j++) {
                long long u = a[i + j];
                long long v = a[i + j + len];
                a[i + j]       = (u + v) % MOD;
                a[i + j + len] = (u - v + MOD) % MOD;
            }
        }
    }
    if (invert) {
        long long inv_n = inv(n, MOD);
        for (auto& x : a) x = x * inv_n % MOD;
    }
}

// OR convolution (subset-sum / zeta transform)
void or_transform(vector<long long>& a, bool invert) {
    int n = a.size();
    for (int len = 1; len < n; len <<= 1) {
        for (int i = 0; i < n; i += len << 1) {
            for (int j = 0; j < len; j++) {
                if (!invert)
                    a[i + j + len] = (a[i + j + len] + a[i + j]) % MOD;
                else
                    a[i + j + len] = (a[i + j + len] - a[i + j] + MOD) % MOD;
            }
        }
    }
}

// AND convolution (superset-sum / zeta transform)
void and_transform(vector<long long>& a, bool invert) {
    int n = a.size();
    for (int len = 1; len < n; len <<= 1) {
        for (int i = 0; i < n; i += len << 1) {
            for (int j = 0; j < len; j++) {
                if (!invert)
                    a[i + j] = (a[i + j] + a[i + j + len]) % MOD;
                else
                    a[i + j] = (a[i + j] - a[i + j + len] + MOD) % MOD;
            }
        }
    }
}

// Usage: XOR convolution of two arrays
vector<long long> xor_convolve(vector<long long> a, vector<long long> b) {
    int n = 1;
    while (n < max(a.size(), b.size())) n <<= 1;
    a.resize(n); b.resize(n);
    wht(a, false); wht(b, false);
    for (int i = 0; i < n; i++) a[i] = a[i] * b[i] % MOD;
    wht(a, true);
    return a;
}

Pattern 6: Power of a Polynomial / Repeated Squaring

When: You need $P (x)^{k}$ for large $k$ .

Combine polynomial multiplication with binary exponentiation. Cost: $O (n \log n \cdot \log k)$ where $n$ is the degree of the result.

cpp

vector<long long> poly_power(vector<long long> base, int k, int max_deg) {
    vector<long long> result = {1};
    while (k > 0) {
        if (k & 1) {
            result = multiply_ntt(result, base);
            if ((int)result.size() > max_deg + 1)
                result.resize(max_deg + 1);
        }
        base = multiply_ntt(base, base);
        if ((int)base.size() > max_deg + 1)
            base.resize(max_deg + 1);
        k >>= 1;
    }
    return result;
}

Complexity Analysis

Operation	Time	Space
FFT (forward or inverse)	$O (n \log n)$	$O (n)$
NTT (forward or inverse)	$O (n \log n)$	$O (n)$
Polynomial multiply	$O (n \log n)$	$O (n)$
WHT / OR / AND transform	$O (n \log n)$	$O (n)$
Arbitrary-mod multiply (split)	$O (n \log n)$	$O (n)$
Poly power via binary exp	$O (n \log n \log k)$	$O (n)$

Constant factors:

FFT (doubles): ~4-5x slower than NTT due to floating point and sin/cos.
NTT (mod 998244353): Fastest for modular problems.
The split trick for arbitrary modulus uses ~7 FFTs total.

When $n$ is the degree bound: Pad to $n' = $ next power of 2 $\geq 2 n$ . For two degree- $d$ polynomials, the product has degree $2 d$ , so pad to $n^{'} \geq 2 d + 1$ .

POLYNOMIAL SIZE -> FFT SIZE -> RESULT SIZE:

  Input:  A has |A| coefficients,  B has |B| coefficients
  Result: C = A*B has |A|+|B|-1 coefficients
  FFT needs: n_fft = next power of 2 >= |A|+|B|-1

  Example:  |A|=3,  |B|=4

  A:  [a₀  a₁  a₂]           3 coefficients
  B:  [b₀  b₁  b₂  b₃]      4 coefficients
  C:  [c₀  c₁  c₂  c₃  c₄  c₅]   = 3+4-1 = 6 coefficients

  FFT size = next_pow2(6) = 8

  Padded A:  [a₀  a₁  a₂  0   0   0   0   0 ]   (8 entries)
  Padded B:  [b₀  b₁  b₂  b₃  0   0   0   0 ]   (8 entries)
  FFT out:   [c₀  c₁  c₂  c₃  c₄  c₅  0   0 ]   (8 entries)
  Result:    [c₀  c₁  c₂  c₃  c₄  c₅]            (take first 6)

  Work done: 3 FFTs of size 8 = 3 x 8 x log₂(8) = 72 operations
  If we padded to 16 instead:  3 x 16 x 4 = 192 operations  (2.67x slower!)

ROOTS OF UNITY ON THE UNIT CIRCLE (n=8):

              ω¹ = e^(2πi/8)
               *
          ω²  ╱     ╲  ω⁰ = 1
            *         *
           ╱           ╲
     ω³  *───────────────*  ω⁷     <- the 8 points where
           ╲           ╱             we evaluate the polynomial
            *         *
          ω⁴  ╲     ╱  ω⁶
               *
              ω⁵

  Key properties:
  • ω⁸ = 1  (they're 8th roots of unity)
  • ωᵏ⁺⁴ = -ωᵏ  (half-circle symmetry -> butterfly trick)
  • Sum of all roots = 0  (cancellation lemma)
  • In NTT: replace unit circle with cyclic group mod p
    ω = g^((p-1)/n) mod p,  where g is a primitive root

RUNTIME SCALING -- WHY CONSTANTS MATTER IN O(n log n):

  FFT size (n)   log₂n   n*log₂n     Relative to 2¹⁷
  ──────────────────────────────────────────────────────
   2¹⁵ =  32K     15       491K       0.14x
   2¹⁶ =  64K     16      1049K       0.30x
   2¹⁷ = 128K     17      2228K       1.00x  <- baseline
   2¹⁸ = 256K     18      4718K       2.12x
   2¹⁹ = 512K     19      9961K       4.47x
   2²⁰ =   1M     20     20972K       9.41x
   2²¹ =   2M     21     44040K      19.77x
   2²² =   4M     22     92274K      41.42x
   2²³ =   8M     23    192937K      86.60x  <- 998244353 max

  Key takeaway: going from 2¹⁷ -> 2²⁰ is ~9x slower,
  not just "a little more padding." Every extra bit in
  the exponent roughly doubles the work.

  Each FFT multiply does 3 transforms (FFT, FFT, IFFT),
  so total work = 3 x n x log₂n butterfly operations.

MEMORY LAYOUT -- CACHE EFFECTS IN FFT:

  Small n (fits in L1/L2 cache):
  ┌──────────────────────────────────────┐
  │ All butterfly operations hit cache   │
  │ -> near-peak throughput               │
  └──────────────────────────────────────┘

  Large n (exceeds cache):
  ┌──────────────────────────────────────┐
  │ Later stages stride > cache line     │
  │ -> cache misses on every access       │
  │ -> 3-5x slower per operation          │
  └──────────────────────────────────────┘

  Practical consequence:
    n = 2¹⁸ (256K complex doubles = 4 MB) ~= L3 boundary
    n = 2²⁰ (1M complex doubles = 16 MB)  -> cache-hostile
    -> NTT (mod arithmetic, no doubles) uses half the memory

Edge Cases & Gotchas

Forgetting to pad to power of 2: The FFT/NTT size must be a power of 2 and at least $| A | + | B | - 1$ .
Precision with doubles: llround instead of (long long) to avoid off-by-one rounding errors. Coefficients above $\sim 10^{15}$ may lose precision.
Modular inverse existence: For NTT, $n$ (the FFT size) must be invertible mod $p$ . Since $p$ is prime and $n$ is a power of 2 that divides $p - 1$ , this is always satisfied for NTT-friendly primes.
Off-by-one in result size: Product of polynomials of sizes $a$ and $b$ has size $a + b - 1$ , not $a + b$ .
NTT overflow: Intermediate products $a [i] \cdot w n$ can overflow long long if both are close to $10^{18}$ . Use __int128 or ensure values are reduced mod $p$ before multiplication (they should be if done correctly).
Negative coefficients in NTT: Add $p$ before taking mod to avoid negative values: (u - v + MOD) % MOD.
Self-convolution double counting: When convolving a frequency array with itself to count pairs, $C [k]$ counts ordered pairs. For unordered pairs, handle $i = j$ and $i \neq j$ separately.
Reusing arrays between test cases: Clear or resize properly. A common bug is leftover data from a previous FFT call.

Mental Traps

"FFT is just an efficient algorithm. I don't need to understand the math." You can use it as a black box -- until you need to modify it. Handling non-power-of-2 sizes, choosing between FFT and NTT, understanding why precision fails at large coefficient values, recognizing that the XOR convolution needs WHT not FFT -- all of these require knowing what the algorithm actually does. The "just copy-paste the template" approach breaks at the first non-standard application.

"If I'm working modulo some prime, use NTT." Only if the prime is NTT-friendly ( $p = c \cdot 2^{k} + 1$ with large enough $k$ ). If the modulus is $10^{9} + 7$ (which is not NTT-friendly), you need either the split-coefficient FFT trick or a different approach. Reaching for NTT when the modulus doesn't support it gives wrong answers, not errors.

"Convolution and polynomial multiplication are different things." They're not. The convolution of arrays $a$ and $b$ at index $k$ is $\sum_{i + j = k} a [i] \cdot b [j]$ . This is exactly the coefficient of $x^{k}$ in the product polynomial $A (x) \cdot B (x)$ . Recognizing "convolution" problems as "polynomial multiplication" problems (and vice versa) doubles the situations where you know FFT/NTT applies.

"I need exact integer arithmetic, so doubles won't work." With llround, FFT on doubles gives exact integer results as long as the intermediate values don't exceed $\sim 10^{15}$ . For two arrays of length $n$ with values up to $v$ , the maximum coefficient in the product is $n \cdot v^{2}$ . If $n \cdot v^{2} < 10^{15}$ , doubles are fine. Otherwise, use NTT.

"I'll just make the FFT size much bigger to be safe." Doubling $n$ adds a full extra butterfly stage -- $O (n \log n)$ means the constant matters. Padding to $2^{20}$ when $2^{17}$ suffices costs ~3.5x more work. Worse, if $n$ exceeds the NTT prime's max supported transform size ( $2^{23}$ for $998244353$ ), the transform silently produces garbage because the required root of unity doesn't exist.

"The result polynomial has the same size as my input arrays." The product of polynomials of sizes $| A |$ and $| B |$ has exactly $| A | + | B | - 1$ coefficients. Reading only $max (| A |, | B |)$ entries misses high-degree terms; reading $| A | + | B |$ includes one garbage trailing zero.

POLYNOMIAL SIZE ARITHMETIC:

  A = [a₀, a₁, a₂]         |A| = 3  (degree 2)
  B = [b₀, b₁, b₂, b₃]    |B| = 4  (degree 3)

  Product C = A x B:
    degree(C) = 2 + 3 = 5
    |C| = 3 + 4 - 1 = 6    <- THIS is how many coefficients

  FFT size needed:
    next_power_of_2(6) = 8

  Common mistakes:
    X Reading |C| = max(3,4) = 4   -> misses c₄, c₅
    X Reading |C| = 3 + 4 = 7      -> c₆ is garbage (always 0, but wasteful)
    X FFT size = max(3,4) = 4      -> aliasing! Results wrap around
    OK Reading |C| = 3 + 4 - 1 = 6  -> correct
    OK FFT size = 8                  -> correct (next power of 2 >= 6)

"I'll just increase the FFT size to be safe." Doubling the FFT size doubles the work ( $O (n \log n)$ is not $O (n)$ ). Going from $2^{17}$ to $2^{18}$ adds ~260K extra operations -- noticeable in tight time limits. Worse, if you're using NTT, exceeding the prime's supported transform size silently produces garbage. For $998244353 = 119 \times 2^{23} + 1$ , the maximum NTT size is $2^{23} = 8,388,608$ . Padding to $2^{24}$ doesn't just waste time -- it gives wrong answers because the required primitive root of order $2^{24}$ doesn't exist mod this prime.

"The result array has the same size as the input." The product of a degree- $(a - 1)$ and a degree- $(b - 1)$ polynomial has degree $a + b - 2$ , so $| C | = | A | + | B | - 1$ coefficients. Reading only $max (| A |, | B |)$ entries silently drops high-order terms. Reading all $n_{fft}$ entries includes trailing zeros (or floating-point noise in FFT). Always resize the result to exactly $| A | + | B | - 1$ .

SIZE RELATIONSHIPS IN POLYNOMIAL MULTIPLICATION:

  A has |A| coefficients (degree |A|-1)
  B has |B| coefficients (degree |B|-1)
  C = A*B has |A|+|B|-1 coefficients

  Example: A = [1, 2, 3]  (|A|=3)
           B = [4, 5, 6, 7]  (|B|=4)

  Product size = 3 + 4 - 1 = 6 coefficients
  FFT size     = next power of 2 >= 6 = 8

  ┌───┬───┬───┬───┬───┬───┬───┬───┐
  │ c₀│ c₁│ c₂│ c₃│ c₄│ c₅│ 0 │ 0 │  <- FFT output (size 8)
  └───┴───┴───┴───┴───┴───┴───┴───┘
  ├─── result: take first 6 ───┤ ├junk┤

  Common bugs:
  X result.resize(max(|A|,|B|)) = 4  -> loses c₄, c₅
  X result.resize(n_fft) = 8         -> includes garbage zeros
  OK result.resize(|A|+|B|-1) = 6     -> correct

Trap: "Self-convolution counts each pair once."

Self-convolution of a frequency array counts ordered pairs. If freq = [0, 2, 1] (two 1s, one 2), then conv[2] = freq[1]*freq[1] = 4, which counts pairs (1,1), (1,1), (1,1), (1,1) -- all four ordered selections of two elements from {1, 1}. For unordered pairs, you need to subtract the $i = j$ diagonal and divide by 2. Forgetting this adjustment is a classic source of WA.

SELF-CONVOLUTION PAIR COUNTING:

  Array: [1, 1, 2]    freq[1]=2, freq[2]=1

  Raw convolution result for sum k:
    conv[2] = freq[0]*freq[2] + freq[1]*freq[1] + freq[2]*freq[0]
            = 0*1 + 2*2 + 1*0 = 4    <- ORDERED pairs summing to 2

  Ordered pairs summing to 2: (1,1) x 4 ways
    (elem 0 + elem 1), (elem 1 + elem 0),
    (elem 0 + elem 0), (elem 1 + elem 1)  <- includes i=j!

  Adjustments for UNORDERED DISTINCT pairs:
    1. Subtract i=j: conv[2k] -= freq[k]  (for each k)
    2. Divide by 2:  conv[s] /= 2

  ┌──────────────────────────────────────────────┐
  │ Raw conv = ordered pairs (including i=j)     │
  │ Subtract diagonal -> ordered pairs (i!=j)      │
  │ Divide by 2 -> unordered pairs (i!=j)          │
  └──────────────────────────────────────────────┘

Trap: "I can do NTT with any prime, as long as I have a primitive root."

You need a primitive root of order exactly $n$ (your FFT size), meaning $n ∣ (p - 1)$ . Having a primitive root modulo $p$ (order $p - 1$ ) isn't enough -- you need $p - 1$ to be divisible by $n$ . For $p = 10^{9} + 7$ , since $p - 1 = 2 \times 500000003$ , the only power-of-2 NTT size supported is $n = 2$ . That's why 998244353 ( $= 119 \times 2^{23} + 1$ ) is the standard choice.

NTT PRIME COMPATIBILITY CHECK:

  For NTT of size n (must be power of 2):
    REQUIREMENT: n divides (p - 1)

  p = 998244353:  p-1 = 998244352 = 119 x 2²³
    Max n = 2²³ = 8,388,608  OK (plenty for contests)

  p = 10⁹ + 7:   p-1 = 10⁹ + 6 = 2 x 500000003
    Max n = 2¹ = 2            X (useless!)

  p = 469762049:  p-1 = 469762048 = 7 x 2²⁶
    Max n = 2²⁶ = 67,108,864  OK (even bigger!)

  ┌───────────────────────────────────────────────┐
  │ Before using NTT, ALWAYS verify:              │
  │   1. p is prime                               │
  │   2. Find largest power of 2 dividing p-1     │
  │   3. That power >= your required FFT size      │
  │   4. Know the primitive root (usually g=3)    │
  └───────────────────────────────────────────────┘

Debugging Checklist

When your FFT/NTT solution gives wrong answers:

[ ] Is n (FFT size) a power of 2 and >= sum of input sizes?
[ ] Did you resize BOTH input arrays to n before FFT?
[ ] Is the result array read up to the correct index (a.size() + b.size() - 2)?
[ ] For FFT: are you using llround(), not (long long) cast?
[ ] For NTT: is the modulus NTT-friendly? Does 2^k divide (p-1)?
[ ] For NTT: is the primitive root correct for your modulus?
[ ] Is the inverse FFT dividing by n (or multiplying by n^{-1} mod p)?
[ ] For convolution: did you reverse the correct array?
[ ] Are intermediate multiplications overflowing?
[ ] For multiple test cases: are you clearing/resizing properly?

Quick sanity test: Multiply $[1, 1] \times [1, 1]$ . Expected: $[1, 2, 1]$ (since $(1 + x) (1 + x) = 1 + 2 x + x^{2}$ ).

cpp

// Debug helper: test basic multiplication
void test_fft() {
    vector<int> a = {1, 1}, b = {1, 1};
    auto c = multiply(a, b);
    // c should be [1, 2, 1]
    assert(c[0] == 1 && c[1] == 2 && c[2] == 1);
}

Worked Examples

Example 0: Step-by-Step FFT Multiply -- $(1 + 2 x) \times (3 + x)$

Goal: Compute $(1 + 2 x) (3 + x) = 3 + 7 x + 2 x^{2}$ using the FFT pipeline.

$A = [1, 2]$ (coefficients of $1 + 2 x$ ), $B = [3, 1]$ (coefficients of $3 + x$ ). Product has $2 + 2 - 1 = 3$ coefficients, so pad to $n = 4$ (next power of 2).

STEP 0: PAD TO n=4
  A = [1, 2, 0, 0]     B = [3, 1, 0, 0]

STEP 1: EVALUATE (forward FFT at 4th roots of unity)
  The 4th roots of unity: w_4^0=1, w_4^1=i, w_4^2=-1, w_4^3=-i

  A(w^0) = 1 + 2(1)  + 0 + 0 = 3
  A(w^1) = 1 + 2(i)  + 0 + 0 = 1 + 2i
  A(w^2) = 1 + 2(-1) + 0 + 0 = -1
  A(w^3) = 1 + 2(-i) + 0 + 0 = 1 - 2i

  B(w^0) = 3 + 1(1)  + 0 + 0 = 4
  B(w^1) = 3 + 1(i)  + 0 + 0 = 3 + i
  B(w^2) = 3 + 1(-1) + 0 + 0 = 2
  B(w^3) = 3 + 1(-i) + 0 + 0 = 3 - i

  A^ = [3,  1+2i,  -1,  1-2i]
  B^ = [4,  3+i,    2,  3-i ]

STEP 2: POINTWISE MULTIPLY
  C^[0] = 3 * 4          = 12
  C^[1] = (1+2i)(3+i)    = 3+i+6i+2i^2 = 3+7i-2 = 1+7i
  C^[2] = (-1) * 2       = -2
  C^[3] = (1-2i)(3-i)    = 3-i-6i+2i^2 = 3-7i-2 = 1-7i

  C^ = [12,  1+7i,  -2,  1-7i]

STEP 3: INTERPOLATE (inverse FFT = evaluate at w^{-1}, divide by n)
  Use IFFT formula: c[k] = (1/n) * sum_{j=0}^{n-1} C^[j] * w^{-jk}

  c[0] = (12 + (1+7i) + (-2) + (1-7i)) / 4 = 12/4 = 3
  c[1] = (12 + (1+7i)(w^{-1}) + (-2)(w^{-2}) + (1-7i)(w^{-3})) / 4
       = (12 + (1+7i)(-i) + (-2)(-1) + (1-7i)(i)) / 4
       = (12 + (-i+7) + 2 + (i+7)) / 4 = 28/4 = 7
  c[2] = (12 + (1+7i)(-1) + (-2)(1) + (1-7i)(-1)) / 4
       = (12 - 1 - 7i - 2 - 1 + 7i) / 4 = 8/4 = 2
  c[3] = 0  (as expected -- no x^3 term)

RESULT:  C = [3, 7, 2]  -->  3 + 7x + 2x^2   OK

Example 1: Big Number Multiplication

Problem: Multiply two large numbers represented as digit strings.

Approach: Treat each number as a polynomial where coefficient $i$ is the $i$ -th digit (from least significant). Multiply via FFT. Propagate carries.

cpp

string big_multiply(string& s1, string& s2) {
    int n1 = s1.size(), n2 = s2.size();
    vector<int> a(n1), b(n2);
    // Reverse so index 0 = least significant digit
    for (int i = 0; i < n1; i++) a[i] = s1[n1 - 1 - i] - '0';
    for (int i = 0; i < n2; i++) b[i] = s2[n2 - 1 - i] - '0';

    auto c = multiply(a, b);

    // Propagate carries
    string res;
    long long carry = 0;
    for (int i = 0; i < (int)c.size() || carry; i++) {
        long long cur = carry + (i < (int)c.size() ? c[i] : 0);
        res += (char)('0' + cur % 10);
        carry = cur / 10;
    }
    // Remove leading zeros
    while (res.size() > 1 && res.back() == '0') res.pop_back();
    reverse(res.begin(), res.end());
    return res;
}

Example 2: Counting Subset Sums

Problem: Given $n$ items with values $v_{1}, \dots, v_{n}$ , count the number of subsets whose values sum to each possible total $S$ ( $0 \leq S \leq V$ ), where $V = \sum v_{i}$ .

Approach: The generating function for item $i$ is $(1 + x^{v_{i}})$ . The answer is the product of all these polynomials. Use divide-and-conquer polynomial multiplication to avoid $O (n)$ sequential multiplications.

cpp

// D&C polynomial multiplication of (1 + x^{v[0]}) * ... * (1 + x^{v[n-1]})
vector<long long> dc_multiply(vector<vector<long long>>& polys, int l, int r) {
    if (l == r) return polys[l];
    int mid = (l + r) / 2;
    auto left = dc_multiply(polys, l, mid);
    auto right = dc_multiply(polys, mid + 1, r);
    return multiply_ntt(left, right);
}

vector<long long> count_subset_sums(vector<int>& values) {
    vector<vector<long long>> polys;
    for (int v : values) {
        vector<long long> p(v + 1, 0);
        p[0] = 1;
        p[v] = 1;
        polys.push_back(p);
    }
    return dc_multiply(polys, 0, (int)polys.size() - 1);
}

Example 3: Codeforces-Style Problem

Problem (CF 954I style): Given a binary text $T$ and pattern $P$ (with wildcards ?), find all positions in $T$ where $P$ matches.

Approach: Define $t [i]$ , $p [j]$ as the character values ( $0$ / $1$ , with wildcard $= 0$ ). A match at position $s$ exists iff: $\sum_{j} (t [s + j] - p [j])^{2} \cdot p [j] \cdot t [s + j] = 0$ (wildcards zeroed).

Reverse $p$ and this becomes a convolution.

cpp

vector<int> wildcard_match(string& text, string& pat) {
    int n = text.size(), m = pat.size();

    // Encode: '1' -> 1, '0' -> -1 (or keep as 0/1), '?' -> 0
    // For simplicity, use the direct squared-difference approach
    // with 3 convolutions.

    // t[i], p[j]: character value (0 for wildcard)
    // Match at s: sum_j p[j]^3 * t[s+j] - 2*p[j]^2*t[s+j]^2 + p[j]*t[s+j]^3
    // = sum_j t[s+j]*p[j]*(t[s+j]-p[j])^2

    // Reverse p for convolution
    // ... (3 convolutions, check if result[s] == 0)

    // Placeholder for full implementation:
    vector<int> match_positions;
    // (see pattern 4 above for the full technique)
    return match_positions;
}

Self-Test

Without looking at the reference material, you should be able to:

Write the iterative FFT from memory -- including bit-reversal permutation, butterfly loop, and the invert flag for inverse FFT.
Write the multiply wrapper -- computing the required FFT size, padding both arrays, forward FFT, pointwise multiply, inverse FFT, and rounding results.
State when to use FFT vs. NTT -- and which moduli are NTT-friendly. Why is $998244353$ special? Why is $10^{9} + 7$ problematic?
Recognize a frequency-array convolution problem -- "count pairs summing to $k$ " -> build freq array -> self-convolve.
State the WHT for XOR convolution -- at least at the level of "it's a different transform, same structure."
Explain the bit-reversal permutation -- why does the iterative FFT reorder inputs this way, and what happens if you skip it?
Trace through a small FFT by hand -- e.g., multiply $[1, 1] \times [1, 1]$ through the full evaluate -> pointwise multiply -> interpolate pipeline and verify the result is $[1, 2, 1]$ .

[ ] Given arrays of sizes 5 and 7, compute the exact FFT size needed and the exact result array size. (Answer: result size = 11, FFT size = 16.)
[ ] Compute $g^{(p - 1) / 8} mod p$ for $p = 998244353$ , $g = 3$ to find the primitive 8th root of unity used in an NTT of size 8.
[ ] Manually verify that llround gives correct results for multiplying $[10^{5}, 10^{5}] \times [10^{5}, 10^{5}]$ via FFT on doubles. What is the largest coefficient, and is it below $10^{15}$ ?
[ ] Write a modular-arithmetic multiply wrapper for NTT that handles: (a) computing FFT size, (b) padding both arrays, (c) forward NTT, (d) pointwise multiply mod $p$ , (e) inverse NTT with $n^{- 1}$ scaling, (f) trimming the result to $| A | + | B | - 1$ .
[ ] Given a problem "count the number of ways to pick two elements from an array with sum $k$ ," write the full solution using frequency-array self-convolution, correctly handling the $i = j$ and ordered-vs-unordered pair adjustments.
[ ] Given $| A | = 500000$ and $| B | = 300000$ , compute the required FFT size ( $2^{20} = 1048576 \geq 799999$ ) and verify it fits within 998244353's max NTT size ( $2^{23}$ ).
[ ] Compute the primitive $8$ -th root of unity modulo $998244353$ : $g^{(p - 1) / 8} = 3^{(998244353 - 1) / 8} mod 998244353$ . Verify it has order exactly 8.
[ ] For arrays with values up to $10^{6}$ and length $10^{5}$ , compute the max coefficient ( $n \cdot v^{2} = 10^{17}$ ) and decide: FFT-doubles or NTT? (Answer: NTT -- exceeds $10^{15}$ precision limit.)
[ ] Write an NTT-based multiply function that works mod 998244353, test it on $[1, 1] \times [1, 1, 1]$ -> $[1, 2, 2, 1]$ .
[ ] Given a problem "count unordered pairs $(i, j)$ with $a_{i} + a_{j} = k$ ", write the frequency-array convolution and handle the $i = j$ double-counting correction.
[ ] Multiply $[1, 2, 3] \times [4, 5]$ by hand through the full FFT pipeline: pad to size 4, compute 4th roots of unity, evaluate both polynomials, pointwise multiply, inverse FFT, and verify $[4, 13, 22, 15]$ .
[ ] Given a frequency array freq = [0, 3, 0, 2, 0, 1] (three 1s, two 3s, one 5), compute the self-convolution entry for sum $k = 4$ and explain the ordered-vs-unordered pair adjustment.
[ ] Write the bit-reversal permutation for $n = 16$ -- list all 16 indices and their bit-reversed counterparts in a table.
[ ] Explain why padding to $2^{18}$ when $2^{17}$ suffices roughly doubles the FFT runtime -- relate this to the $O (n \log n)$ formula and compute the actual operation counts.
[ ] For an NTT mod 998244353, compute the primitive 4th root of unity: $3^{(998244353 - 1) / 4} mod 998244353$ . Verify it has order exactly 4 by checking that its square equals $- 1 (\mod p)$ .

Practice Problems

#	Problem	Key Technique	Difficulty
1	SPOJ POLYMUL	Direct polynomial multiplication	CF ~1600
2	CF 632E - Thief in a Shop	Generating functions + poly power	CF 2000
3	CF 993E - Nikita and Order Statistics	Counting via convolution	CF 1900
4	CF 954I - Yet Another String Matching	String matching via FFT	CF 2200
5	CF 958F3 - Lightsabers (hard)	NTT + generating functions	CF 2100
6	CF 623E - Transforming Sequence	NTT + binary lifting on polys	CF 2600
7	Atcoder ATC001 C - Fast Fourier Transform	Basic FFT practice	CF ~1700
8	CF 528D - Fuzzy Search	FFT for approximate matching	CF 2000
9	CF 1096G - Lucky Tickets	NTT on frequency arrays	CF 2000
10	CF 300D - Painting Squares	FFT + combinatorics	CF 2100
11	Apples and Bananas	Convolution of two frequency arrays	CSES 2000
12	One Bit Positions	Bitstring autocorrelation via FFT	CSES 2000
13	Signal Processing	Cross-correlation of two signals	CSES 2000

Suggested progression:

Start with SPOJ POLYMUL and Atcoder ATC001 C to verify your template.
Move to CF 993E and 1096G for standard NTT applications.
Try CF 632E for polynomial exponentiation.
Tackle CF 528D and 954I for convolution-based string matching.
Attempt CF 623E for a challenging NTT + D&C combination.

Summary

FFT / NTT Cheat Sheet
=====================

WHEN:   Polynomial multiplication, convolution, "count sums", n >= 10^5
TIME:   O(n log n) per multiply
SPACE:  O(n)

FFT (doubles):
  + Works with any modulus (or no modulus)
  - Precision issues for large coefficients (> ~10^14)
  - Slower constant factor

NTT (mod p):
  + Exact, fast, no precision issues
  - Only works modulo NTT-friendly primes
  - 998244353 is the standard contest prime

RECIPE:
  1. Pad both arrays to next power of 2 >= |a| + |b| - 1
  2. Forward FFT/NTT on both
  3. Pointwise multiply
  4. Inverse FFT/NTT
  5. (FFT) Round to nearest integer / (NTT) done

BITWISE VARIANTS:
  XOR convolution -> Walsh-Hadamard Transform
  OR  convolution -> Sum-over-subsets (zeta) transform
  AND convolution -> Superset-sum transform
  All O(n log n), same butterfly structure.

ARBITRARY MODULUS:
  Split coefficients into 15-bit halves, use 3-4 FFTs on doubles.

The Aha Moment

Multiplying two polynomials of degree $n$ takes $O (n^{2})$ in coefficient form -- every term of one must multiply every term of the other. But in point-value form, polynomial multiplication is just element-wise multiplication: $O (n)$ . FFT converts between these two representations in $O (n \log n)$ : evaluate at $n$ special points (roots of unity), multiply pointwise, then interpolate back. The trick is choosing evaluation points that make the DFT matrix decompose recursively -- that's where the roots of unity come in.

NTT is the same idea, but over $Z / p Z$ instead of $C$ -- using primitive roots modulo a prime (like $998244353 = 119 \cdot 2^{23} + 1$ ) as "roots of unity." No floating-point errors, exact modular arithmetic.

Pattern Fingerprint

Constraint / Signal	Technique
"Multiply two polynomials" or "convolve two sequences"	FFT / NTT
$n \leq 10^{6}$ and the naive $O (n^{2})$ convolution TLEs	FFT / NTT gives $O (n \log n)$
"Count pairs $(i, j)$ with $a_{i} + a_{j} = k$ for all $k$ "	Convolution of frequency arrays
"Multiply large numbers" ( $10^{50000}$ digits)	Treat digits as polynomial coefficients, FFT-multiply
Answer must be modulo $998244353$	NTT -- this prime has $2^{23}$ as a factor of $p - 1$
Answer modulo arbitrary prime (e.g., $10^{9} + 7$ )	Split coefficients + multiple NTTs, or CRT with NTT-friendly primes
"Partition function" / "subset sum count" / generating functions	Polynomial operations via FFT
Divide-and-conquer optimization with convolution	Online FFT / divide-and-conquer FFT

Rating Progression

CF Rating	What You Should Be Able To Do
1200	Know that polynomial multiplication can be done faster than $O (n^{2})$ ; not expected to implement
1500	Recognize problems reducible to convolution (frequency array multiplication); use library FFT
1800	Implement FFT/NTT from scratch; handle size padding to powers of 2; know when to use NTT vs FFT
2100	Polynomial division, modular composition, online convolution; combine FFT with generating functions; handle arbitrary modulus via CRT or splitting

Before You Code Checklist

[ ] Pad both polynomials to the same size = next power of 2 $\geq n_{1} + n_{2} - 1$ -- the result polynomial has degree $n_{1} + n_{2} - 2$ , so you need that many points
[ ] Choose FFT (doubles) or NTT (modular) -- FFT is simpler but has precision issues for $n > 10^{5}$ with large coefficients; NTT is exact but requires a suitable prime
[ ] For NTT: verify the prime supports the needed transform size -- $998244353$ supports up to $2^{23}$ ; for larger, use a different prime or CRT
[ ] Bit-reverse the array before the butterfly -- or use the iterative FFT with bit-reversal permutation
[ ] After inverse FFT: divide by $n$ -- the inverse transform includes a $1 / n$ factor; forgetting it gives answers scaled by $n$

What Would You Do If...?

Scenario 1: "The problem needs convolution modulo $10^{9} + 7$ , not $998244353$ ." $10^{9} + 7 - 1 = 2 \times 5 \times 10^{8}$ -- not enough factors of 2 for NTT at large sizes. Options: (1) Use FFT with doubles if $n \leq 10^{5}$ and coefficients $\leq 10^{9}$ (careful with precision). (2) Split coefficients into halves (each $\leq \sqrt{M O D}$ ), do 3-4 FFTs on doubles. (3) Use CRT: run NTT with 2-3 NTT-friendly primes, then combine results with the Chinese Remainder Theorem.

Scenario 2: "You need to multiply polynomials of degree $10^{6}$ with coefficients up to $10^{9}$ , and the answer must be exact." Intermediate values in convolution can be as large as $n \cdot C^{2} \approx 10^{6} \cdot 10^{18} = 10^{24}$ , which far exceeds double precision ( $\sim 2^{53} \approx 9 \times 10^{15}$ ). You MUST use NTT, or FFT with coefficient splitting. Using plain double FFT will give wrong answers.

Scenario 3: "The problem reduces to multiplying many polynomials $P_{1} \cdot P_{2} \dots P_{n}$ ." Don't multiply left to right -- that's $O (n^{2} \log n)$ in the worst case. Use a priority queue (merge the two smallest polynomials first, like Huffman coding) for $O (n \log^{2} n)$ total. Alternatively, divide-and-conquer: multiply $P_{1} \dots P_{n / 2}$ and $P_{n / 2 + 1} \dots P_{n}$ recursively.

The Mistake That Teaches

Story: You implement NTT for the first time. You use mod $998244353$ , primitive root $g = 3$ . Sample tests pass. You submit a problem requiring the convolution of two arrays of size $5 \times 10^{5}$ -- Wrong Answer.

You double-check your NTT implementation: butterfly structure looks right, forward and inverse transforms are consistent. You test: multiply $[1, 1]$ by $[1, 1]$ -- get $[1, 2, 1]$ OK. Multiply two random arrays of size 100 -- matches brute force OK.

Then you notice: your padded size $n$ is $2^{20} = 1048576$ , and the inverse NTT divides by $n$ using inv_n = power(n, MOD-2). You print inv_n... it looks right. But wait -- you're computing power(n, MOD-2, MOD) where n is an int, and $2^{20}$ fits in an int. That's fine. Except... your power function takes long long arguments but inside you wrote base *= base without % MOD. For the forward NTT, the root of unity w = power(g, (MOD-1)/n) -- you check: (MOD-1)/n for n = 1048576 gives (998244352) / 1048576 = 952. So w = power(3, 952, MOD). You print w... it's correct.

After two hours, you find it: in the butterfly loop, you wrote t = a[j + half] * w % MOD -- but a[j + half] * w overflows long long because both are up to $10^{9}$ , and $10^{9} \times 10^{9} = 10^{18}$ , which DOES fit in long long ( $< 9.2 \times 10^{18}$ ). So that's not the issue either.

Finally, you spot it: you forgot to take % MOD after a[j] + t. The sum of two values each up to $10^{9}$ is $2 \times 10^{9}$ , which overflows... no, it fits in long long. But it DOESN'T stay in $[0, MOD)$ , so subsequent multiplications CAN overflow: $(2 \times 10^{9})^{2} > 2^{63}$ .

The fix: Always % MOD after BOTH addition AND subtraction in the butterfly:

cpp

long long t = a[j + half] % MOD * w % MOD;
a[j + half] = (a[j] - t + MOD) % MOD;
a[j] = (a[j] + t) % MOD;

Lesson: In modular NTT, take % MOD after every single arithmetic operation. Values that "fit in long long" today will overflow when multiplied tomorrow.

Debug This

Bug 1:

Find the bug in this FFT implementation

cpp

void fft(vector<complex<double>>& a, bool invert) {
    int n = a.size();
    // bit-reversal permutation
    for (int i = 1, j = 0; i < n; i++) {
        int bit = n >> 1;
        for (; j & bit; bit >>= 1) j ^= bit;
        j ^= bit;
        if (i < j) swap(a[i], a[j]);
    }
    // butterfly
    for (int len = 2; len <= n; len <<= 1) {
        double ang = 2 * M_PI / len * (invert ? -1 : 1);
        complex<double> wlen(cos(ang), sin(ang));
        for (int i = 0; i < n; i += len) {
            complex<double> w(1);
            for (int j = 0; j < len; j++) {  // BUG HERE
                complex<double> u = a[i+j], v = a[i+j+len/2] * w;
                a[i+j] = u + v;
                a[i+j+len/2] = u - v;
                w *= wlen;
            }
        }
    }
    if (invert) for (auto& x : a) x /= n;
}

Bug: The inner loop runs j from 0 to len - 1, but it should only go to len/2 - 1. When j >= len/2, the index i + j + len/2 goes out of the current block (accesses i + len and beyond). This reads/writes wrong elements.

Fix: Change j < len to j < len / 2.

Bug 2:

Find the bug in this NTT-based polynomial multiplication

cpp

vector<long long> multiply(vector<long long>& a, vector<long long>& b) {
    int result_size = a.size() + b.size() - 1;
    int n = 1;
    while (n < result_size) n <<= 1;
    a.resize(n); b.resize(n);

    ntt(a, false);  // forward NTT
    ntt(b, false);
    for (int i = 0; i < n; i++)
        a[i] = a[i] * b[i] % MOD;
    ntt(a, true);   // inverse NTT

    a.resize(result_size);
    return a;
}

Bug: The function modifies a and b in place (resizing them with zero-padding). If the caller reuses a or b after this call, they'll find their original data destroyed and padded with zeros. Also, a[i] * b[i] may overflow if both are close to MOD ( $\sim 10^{9}$ ).

Fix: Work on copies:

cpp

vector<long long> fa(a.begin(), a.end()), fb(b.begin(), b.end());
fa.resize(n); fb.resize(n);

And for the multiplication: fa[i] = fa[i] % MOD * fb[i] % MOD;

Bug 3:

Find the bug in this convolution for counting pair sums

cpp

// Count number of pairs (i,j) with a[i] + a[j] = k, for each k
// a[i] in [0, 100000]
vector<long long> freq(100001, 0);
for (int x : a) freq[x]++;

// Convolve freq with itself
auto result = multiply(freq, freq);

// result[k] = number of ordered pairs with a[i] + a[j] = k
// But we want unordered pairs, and exclude i == j
for (int k = 0; k < result.size(); k++) {
    result[k] -= freq[k/2];       // subtract pairs where i == j ???
    result[k] /= 2;               // unordered pairs
}

Bug: The "subtract diagonal" logic is wrong. A pair $(i, i)$ contributes to result[2 * a[i]], not result[a[i]]. To remove self-pairs, subtract freq[x] from result[2*x] for each $x$ . The current code subtracts freq[k/2] from result[k] for ALL $k$ , which is wrong for odd $k$ (there are no self-pairs summing to odd values when pairing identical elements) and wrong in magnitude.

Fix:

cpp

for (int x = 0; x <= 100000; x++)
    result[2 * x] -= freq[x];     // remove i == j pairs
for (int k = 0; k < result.size(); k++)
    result[k] /= 2;               // unordered

Common Mistakes

Not padding to next power of 2. FFT requires array sizes to be powers of 2. The result array size must be at least deg(A) + deg(B) + 1, rounded up.
Precision loss with FFT over doubles. For large coefficients or many terms, use NTT (mod prime) or split into high/low halves.
Wrong NTT prime or generator. The prime must be of the form p = c * 2^k + 1 where 2^k >= n. Standard: 998244353 = 119 * 2^23 + 1, generator 3.
Forgetting to divide by n in inverse FFT. The IFFT step requires dividing each coefficient by n (or by n mod p for NTT).
Self-convolution diagonal error. When counting unordered pairs (i,j) with a[i]+a[j]=k, subtract self-pairs from result[2*x] for each x, then divide by 2.

Historical Origin

The Fast Fourier Transform was published by James Cooley and John Tukey in 1965, revolutionizing digital signal processing. However, historians later discovered that Carl Friedrich Gauss had described essentially the same algorithm around 1805 for interpolating asteroid orbits -- predating Cooley-Tukey by 160 years. The Number Theoretic Transform (NTT) was developed in the 1970s as the modular-arithmetic analog, avoiding floating-point entirely.

Mnemonic Anchor

"Evaluate, multiply pointwise, interpolate -- $O (n \log n)$ beats $O (n^{2})$ by splitting at roots of unity."

Recap and Cross-Links

Concept	Link
Math Fundamentals	01-math-fundamentals
Number Theory	02-number-theory
Divide and Conquer	14-divide-and-conquer
Practice	Ladder - Mixed

FFT & NTT (Polynomial Multiplication) ​

Contents ​

FFT vs NTT — pick before you implement ​

Intuition ​

The Pain ​

The Key Insight ​

Visual Walkthrough ​

The Invariant ​

When to Reach for This ​

Insight Gaps ​

What the Code Won't Teach You ​

How It Works (Detailed Mechanics) ​

Roots of Unity Primer ​

Cooley-Tukey (Recursive FFT) ​

Iterative FFT (Bit-Reversal) ​

NTT: Modular Arithmetic Version ​

Implementation ​

Recursive FFT (Educational) ​

Iterative FFT (Contest-Ready, Doubles) ​

NTT (Contest-Ready, mod 998244353) ​

Convolution Wrapper (Frequency Array) ​

Arbitrary Modulus Convolution (Split Trick) ​

Patterns & Applications ​

Pattern 1: Direct Polynomial Multiplication ​

Pattern 2: Counting with Generating Functions ​

Pattern 3: Frequency Convolution (Pair Counting) ​

Pattern 4: String Matching via Convolution ​

Pattern 5: Bitwise Convolution (OR/AND/XOR) ​

Pattern 6: Power of a Polynomial / Repeated Squaring ​

Complexity Analysis ​

Edge Cases & Gotchas ​

Mental Traps ​

Debugging Checklist ​

Worked Examples ​

Example 0: Step-by-Step FFT Multiply -- (1+2x)×(3+x) ​

Example 1: Big Number Multiplication ​

Example 2: Counting Subset Sums ​

Example 3: Codeforces-Style Problem ​

Self-Test ​

Practice Problems ​

Summary ​

The Aha Moment ​

Pattern Fingerprint ​

Rating Progression ​

Before You Code Checklist ​

What Would You Do If...? ​

The Mistake That Teaches ​

Debug This ​

Common Mistakes ​

Historical Origin ​

Mnemonic Anchor ​

Recap and Cross-Links ​