Skip to content

FFT & NTT (Polynomial Multiplication)

Quick Revisit

  • USE WHEN: polynomial multiplication and sum convolution (counting pairs with sum = k). For XOR convolution use FWHT (Walsh-Hadamard transform), not FFT.
  • INVARIANT: evaluate at n points, multiply pointwise, interpolate back -- convolution theorem
  • TIME: O(n log n) | SPACE: O(n)
  • CLASSIC BUG: array size not padded to next power of 2 -- causes out-of-bounds or wrong results
  • PRACTICE: ../12-practice-ladders/08-ladder-mixed.md

Difficulty: [Advanced] CF 1900-2200+ Prerequisites: Math Fundamentals, Number Theory, Divide and ConquerMotivation: When you need to multiply polynomials or compute convolutions in O(nlogn) instead of O(n2).

Contest Frequency: [*] Specialized -- appears in polynomial/convolution problems


Contents

FFT vs NTT — pick before you implement

The two transforms share the same convolution-theorem skeleton but differ in arithmetic:

FFTNTT
FieldComplex numbers (double / long double)Z/pZ for a special prime p
"Roots of unity"e2πik/n on the unit circleA primitive n-th root of unity mod p (e.g. 3 mod 998244353)
Exact?No — accumulates floating-point rounding errorYes — every operation is an integer mod p
RiskLoss of precision when coefficients × n is largeModulus must support n as a divisor of p1; works only for "NTT-friendly" primes
Use whenReal-valued / complex coefficients, or output fits comfortably in double precisionAnswer is required modulo a specific prime (typically 998244353=119223+1)

Decision rule: if the problem asks for the answer modulo a fixed NTT-friendly prime (especially 998244353 or 109+7 via three-prime NTT), reach for NTT. If the coefficients are small reals or the modulus is arbitrary and you can afford the precision loss, FFT is fine. Mixing them inside one routine is rarely worth it for contest scope.


Intuition

The Pain

Multiply A(x)=1+2x+3x2 and B(x)=4+5x.

The naive approach: every coefficient of A multiplies every coefficient of B.

          1   2   3       (coefficients of A)
      x   4   5           (coefficients of B)
      -----------
          5  10  15       (multiply by 5, shift by 1)
      +   4   8  12       (multiply by 4, shift by 0)
      -----------
          4  13  22  15   => C(x) = 4 + 13x + 22x^2 + 15x^3

This is exactly like long multiplication. For degree-n polynomials, that is O(n2) multiplications. When n=105 or 106, this is far too slow.

The wall: Any approach that touches every pair of coefficients is O(n2).

The Key Insight

"Evaluate both polynomials at n special points, multiply pointwise in O(n), then interpolate back -- the magic is that roots of unity make evaluation and interpolation O(nlogn) via divide and conquer."

Why does this work? A polynomial of degree n1 is uniquely determined by its values at n distinct points. So the workflow is:

Coefficient repr.          Point-value repr.          Coefficient repr.
   A = [a0..an-1]  ---->  A(w0), A(w1), ..., A(wn-1)
   B = [b0..bn-1]  ---->  B(w0), B(w1), ..., B(wn-1)
                                    |
                           multiply pointwise O(n)
                                    |
                           C(wi) = A(wi) * B(wi)
                                    |
   C = [c0..c2n-2] <----  interpolate back (inverse FFT)

The "special points" are the n-th roots of unity: wk=e2πik/n for k=0,1,,n1. Their algebraic structure -- specifically the halving property wn2k=wn/2k -- enables the divide and conquer split.

Visual Walkthrough

Step 1: Evaluate (FFT forward)

Split polynomial into even and odd indexed coefficients: A(x)=Aeven(x2)+xAodd(x2)

For A(x)=a0+a1x+a2x2+a3x3:

  • Aeven(y)=a0+a2y
  • Aodd(y)=a1+a3y

Then A(wk)=Aeven(wk2)+wkAodd(wk2).

Since wn2 cycles through the (n/2)-th roots of unity, we recurse on half the size!

Butterfly diagram (n=8, showing one level):

Input:  a0  a1  a2  a3  a4  a5  a6  a7
        |       |       |       |
        +--+-+--+  split even/odd
        |  | |  |
Even:  a0  a2  a4  a6      Odd:  a1  a3  a5  a7
        \       /                  \       /
     FFT(n/2)                   FFT(n/2)
        |                          |
    E0 E1 E2 E3              O0 O1 O2 O3
     \  |  |  /                \  |  |  /
      combine with twiddle factors w^k:
      y[k]     = E[k] + w^k * O[k]       (k = 0..n/2-1)
      y[k+n/2] = E[k] - w^k * O[k]       (k = 0..n/2-1)

Full butterfly diagram (n=8) -- all levels of the iterative FFT:

  FFT BUTTERFLY NETWORK (n=8, iterative, bottom-up)
  Inputs in bit-reversed order on the left; output = DFT values on the right.

  Level 0 (input)     Level 1 (2-pt DFTs)   Level 2 (4-pt DFTs)   Level 3 (8-pt DFT)
  bit-reversed idx    combine pairs          combine quads          combine all 8
  ==================  ====================  ====================  ==================
  a0 (000) ----*------+----- X0' ---*------+------ X0'' --*------+------ Y0
               |  w^0 |             |  w^0 |               |  w^0 |
  a4 (100) ----*------+----- X1' ---+------+------ X1'' ---+------+------ Y1
                                    |  w^1 |               |  w^1 |
  a2 (010) ----*------+----- X2' ---*------+------ X2'' ---+------+------ Y2
               |  w^0 |                                    |  w^2 |
  a6 (110) ----*------+----- X3' ---+------+------ X3'' ---+------+------ Y3
                                    |      |               |  w^3 |
  a1 (001) ----*------+----- X4' ---*------+------ X4'' --*------+------ Y4
               |  w^0 |             |  w^0 |               |      |
  a5 (101) ----*------+----- X5' ---+------+------ X5'' ---+------+------ Y5
                                    |  w^1 |               |      |
  a3 (011) ----*------+----- X6' ---*------+------ X6'' ---+------+------ Y6
               |  w^0 |                                    |      |
  a7 (111) ----*------+----- X7' ---+------+------ X7'' ---+------+------ Y7

  Each ---*---+--- is a butterfly:  top = u + w^k * v
                                    bot = u - w^k * v

  Level 1: stride=1, twiddle w^0 (all trivial +/- pairs)
  Level 2: stride=2, twiddles w^0, w^1  (w = w_4 = e^{2*pi*i/4} = i)
  Level 3: stride=4, twiddles w^0, w^1, w^2, w^3  (w = w_8 = e^{2*pi*i/8})

8th roots of unity on the complex plane:

                    Im
                    ^
            w^2     |     w^1
         (-0.7+0.7i)|  (0.7+0.7i)
              *     |     *
                \   |   /
    w^3          \  |  /         w^0
    (-1+0i) *-----\-+-/-------* (1+0i) --> Re
                  / | \
                /   |   \
              *     |     *
         (-0.7-0.7i)|  (0.7-0.7i)
            w^4     |     w^7
        (= -1)      |
              *     |     *
            w^5     |     w^6
         (-0.7-0.7i)|  (0.7-0.7i)

  w_8 = e^{2*pi*i/8},  w^k = (cos(2*pi*k/8), sin(2*pi*k/8))

  Key identity:  (w^k)^2 = w^{2k}  maps 8th roots to 4th roots
                 w^0=1   w^1=(1+i)/sqrt2   w^2=i   w^3=(-1+i)/sqrt2
                 w^4=-1  w^5=(-1-i)/sqrt2  w^6=-i  w^7=(1-i)/sqrt2

  The HALVING PROPERTY: squaring w_8^k gives w_4^k
    w_8^0 -> w_4^0,  w_8^1 -> w_4^1,  w_8^2 -> w_4^2,  w_8^3 -> w_4^3
    w_8^4 -> w_4^0,  w_8^5 -> w_4^1,  w_8^6 -> w_4^2,  w_8^7 -> w_4^3
  This is WHY the even/odd split works: both halves evaluate at SAME points!

The Convolution Theorem -- why FFT makes polynomial multiplication fast:

  TIME DOMAIN                              FREQUENCY DOMAIN
  (coefficient representation)             (point-value representation)
  +---------------------------+            +---------------------------+
  |  A = [a0, a1, ..., an-1]  |  --FFT-->  |  A^ = [A(w^0), ..., A(w^{n-1})] |
  |  B = [b0, b1, ..., bn-1]  |  --FFT-->  |  B^ = [B(w^0), ..., B(w^{n-1})] |
  +---------------------------+            +---------------------------+
            |                                         |
            | Naive multiply:                         | Pointwise multiply:
            | O(n^2)                                  | O(n)
            |                                         |
            v                                         v
  +---------------------------+            +---------------------------+
  |  C = A * B                |  <--IFFT-- |  C^ = A^ . B^            |
  |  (convolution, O(n^2))    |            |  C^[k] = A^[k] * B^[k]   |
  +---------------------------+            +---------------------------+

  Total cost:  FFT(A) + FFT(B) + pointwise + IFFT(C)
             = O(n log n) + O(n log n) + O(n) + O(n log n)
             = O(n log n)

  THE KEY: convolution in time domain = pointwise product in freq domain.
           FFT converts between the two in O(n log n).

Step 2: Pointwise multiply -- trivially O(n).

Step 3: Interpolate (inverse FFT) -- same algorithm, but use w1 instead of w, and divide all results by n.

The Invariant

"At each recursion level, the polynomial is split into even and odd coefficients, evaluated at roots of unity of half the order."

More precisely: at recursion depth d, we evaluate n/2d sub-polynomials, each of degree 2d1, at the 2d-th roots of unity. The butterfly recombination at each level is O(n) work, and there are log2n levels, giving O(nlogn) total.

Why n must be a power of 2: The halving property requires n to divide evenly at every level. We pad the polynomial to the next power of 2.

When to Reach for This

Trigger phrases in problem statements:

SignalTranslation
"polynomial multiplication"Direct FFT/NTT
"convolution" / "convolve"FFT/NTT
"count ways with sum constraints"Generating function + FFT
"n106" with combinatorial sumsProbably needs FFT
"number of pairs with ai+aj=k"Frequency array convolution
"string matching with wildcards"Convolution trick
"XOR/OR/AND convolution"Bitwise convolution variant
"answer modulo 998244353"NTT-friendly prime, strong hint

Quick decision:

  • Floating point OK, moderate n? Use FFT (doubles).
  • Need exact answers mod a prime? Use NTT.
  • 998244353=119223+1 has primitive root 3 and supports NTT up to n=2238×106.

Insight Gaps

Recognizing when a problem needs convolution. The hardest part of FFT problems is not implementing FFT -- it's seeing that the problem is a convolution in disguise. Triggers:

  • "Count pairs (i,j) with ai+aj=k for each k" --> self-convolution of frequency array
  • "Multiply two polynomials" --> direct application
  • "Count ways to roll n dice summing to k" --> multiply generating functions
  • "String matching with wildcards" --> cross-correlation via FFT
  • "XOR convolution" --> Walsh-Hadamard transform (different algorithm, same structure)

WHT for XOR/OR/AND convolutions. When the "index combination" operation is XOR/OR/AND instead of addition, standard FFT doesn't apply. The Walsh-Hadamard Transform (WHT) handles XOR convolution; sum-over-subsets (SOS/zeta transform) handles OR/AND. These are different transforms from FFT but follow the same "forward transform, pointwise operation, inverse transform" structure.

CONVOLUTION TYPE --> ALGORITHM DECISION:

  C[k] = sum_{i + j = k}  A[i] * B[j]    --> FFT / NTT
  C[k] = sum_{i XOR j = k} A[i] * B[j]   --> WHT (Walsh-Hadamard)
  C[k] = sum_{i OR  j = k} A[i] * B[j]   --> Zeta (sum-over-subsets)
  C[k] = sum_{i AND j = k} A[i] * B[j]   --> Zeta (superset-sum)

  All follow the same recipe:
    forward_transform(A), forward_transform(B)
    pointwise multiply
    inverse_transform(result)

  Only the transform function changes.

The bit-reversal permutation. The iterative FFT reorders input by bit-reversal of indices before the butterfly passes. Understanding why (the recursive splitting by even/odd indices produces a specific permutation, and bit-reversal is its inverse) is the difference between blindly using the implementation and being able to debug it.

What the Code Won't Teach You

The FFT code converts between coefficient representation and point-value representation. What it hides is why this is useful: pointwise multiplication is O(n), so the bottleneck is just the two transforms.

THE FFT RECIPE -- what the code actually does:

  Coefficient form            Point-value form
  ┌─────────────┐    FFT     ┌─────────────┐
  │ A = [a₀..aₙ]│──────────->│ A(ω⁰)..A(ωⁿ)│
  └─────────────┘            └──────┬──────┘
                                    │ pointwise
  ┌─────────────┐    FFT     ┌──────┴──────┐   multiply
  │ B = [b₀..bₙ]│──────────->│ B(ω⁰)..B(ωⁿ)│──-> C(ωᵏ)=A(ωᵏ)*B(ωᵏ)
  └─────────────┘            └─────────────┘

  ┌─────────────┐  inv FFT   ┌─────────────┐       │
  │ C = [c₀..cₙ]│<-──────────│ C(ω⁰)..C(ωⁿ)│<-──────┘
  └─────────────┘            └─────────────┘

  Cost: O(n log n) + O(n) + O(n log n) = O(n log n)
  vs. naive coefficient multiply: O(n²)

The code also won't teach you the evaluation-interpolation duality. FFT evaluates a polynomial at n roots of unity (O(nlogn)). Inverse FFT interpolates back from those values. This duality is why the same algorithm (with conjugated roots and a 1/n factor) works for both directions.

BIT-REVERSAL PERMUTATION for n=8:

  Index (decimal): 0  1  2  3  4  5  6  7
  Index (binary):  000 001 010 011 100 101 110 111
  Reversed:        000 100 010 110 001 101 011 111
  New position:    0  4  2  6  1  5  3  7

  Input array: [a₀ a₁ a₂ a₃ a₄ a₅ a₆ a₇]
  After perm:  [a₀ a₄ a₂ a₆ a₁ a₅ a₃ a₇]

  This reordering makes the butterfly stages work in-place:

  Stage 1: pairs (0,1)(2,3)(4,5)(6,7) -- stride 1
  Stage 2: pairs (0,2)(1,3)(4,6)(5,7) -- stride 2
  Stage 3: pairs (0,4)(1,5)(2,6)(3,7) -- stride 4

Finally, when precision fails (coefficients exceed ~1015), the code gives wrong answers silently -- no runtime error, just slightly off integers after llround. The NTT avoids this entirely by working in exact modular arithmetic, but only for NTT-friendly primes.

The code also won't teach you the deep connection between FFT and NTT: both are instances of the Discrete Fourier Transform over different algebraic structures. FFT works over C using complex roots of unity; NTT works over Z/pZ using primitive roots of unity modulo p. The butterfly structure is identical -- only the "twiddle factors" change.

WHY 998244353 IS THE CANONICAL NTT PRIME:

  998244353 = 119 x 2²³ + 1

  Key properties:
  ┌─────────────────────────────────────────────────┐
  │ 1. It's prime              -> modular inverse    │
  │ 2. p-1 = 119 x 2²³        -> 2²³ divides p-1    │
  │ 3. Max NTT size = 2²³     -> ~8 million elements │
  │ 4. Primitive root g = 3   -> verified, standard   │
  └─────────────────────────────────────────────────┘

  Compare with 10⁹+7 = 1000000007:
  ┌─────────────────────────────────────────────────┐
  │ 10⁹+7 - 1 = 2 x 500000003                      │
  │ Largest power of 2 dividing (p-1) = 2¹          │
  │ Max NTT size = 2  <- USELESS for NTT!            │
  └─────────────────────────────────────────────────┘

  When you need mod 10⁹+7: use split-coefficient FFT
  (3-4 real FFTs on 15-bit halves) or multi-prime NTT
  with CRT reconstruction.
BUTTERFLY DIAGRAM for n=8 FFT:

  a₀ ──┬──────────┬──────────┬──── X₀
  a₄ ──┤  ω⁰     ┤  ω⁰     ┤
       ├──────────┤          │
  a₂ ──┤          ├──────────┤──── X₁
  a₆ ──┤  ω⁰     │  ω¹     │
       ├──────────┤          │
  a₁ ──┤          ├──────────┤──── X₂
  a₅ ──┤  ω⁰     │  ω²     │
       ├──────────┤          │
  a₃ ──┤          ├──────────┤──── X₃
  a₇ ──┘  ω⁰     ┘  ω³     ┘

  Each stage: combine pairs with twiddle factor ωᵏ
  3 stages for n=8 (log₂8 = 3)

The connection between FFT and NTT: both are DFT, just over different rings. FFT computes the Discrete Fourier Transform over C using complex roots of unity e2πi/n. NTT computes the same DFT over Z/pZ using a primitive root of unity modulo p. The butterfly structure, the divide-and-conquer recurrence, and the bit-reversal permutation are identical -- only the "numbers" and their arithmetic change. This is why your FFT template converts to NTT by swapping complex<double> with long long and replacing ω=e2πi/n with g(p1)/nmodp.

The evaluation-interpolation duality is the soul of FFT. The FFT code performs two operations: evaluate a polynomial at n roots of unity (forward FFT) and recover coefficients from values at those points (inverse FFT). These are exact inverses of each other because the DFT matrix F and its inverse F1=1nF¯ are related by complex conjugation and scaling. Understanding this duality means you can derive the inverse FFT from the forward FFT without memorizing it separately -- just conjugate the roots and divide by n. It also explains why the same butterfly structure works in both directions.

Why 998244353 is the competitive programmer's favorite NTT prime.998244353=119×223+1. The factor 223 means this prime supports NTT of size up to 223=8,388,608 -- large enough for almost any contest problem. A primitive root is g=3. Compare with 109+7=500000003×2+1: the factor of 2 is only 21, so it only supports NTT of size 2 -- useless in practice.

FFT vs NTT -- SAME ALGORITHM, DIFFERENT RINGS:

  ┌──────────────────────┬───────────────────────────────┐
  │        FFT            │            NTT                │
  ├──────────────────────┼───────────────────────────────┤
  │ Ring: ℂ (complex)     │ Ring: ℤ/pℤ (mod prime p)     │
  │ Root: ω = e^(2πi/n)  │ Root: ω = g^((p-1)/n) mod p  │
  │ Inverse: conjugate    │ Inverse: modular inverse      │
  │ Scaling: ÷ n          │ Scaling: x n⁻¹ mod p         │
  │ Precision: <= 10¹⁵     │ Precision: exact (mod p)     │
  ├──────────────────────┴───────────────────────────────┤
  │  Butterfly structure, bit-reversal, O(n log n)       │
  │  -- ALL identical between FFT and NTT                 │
  └──────────────────────────────────────────────────────┘

  WHY 998244353?

  998244353  =  119 x 2²³ + 1
                      ^^^
                      max NTT size = 2²³ = 8,388,608
                      primitive root g = 3

  Compare:
  10⁹ + 7   =  500000003 x 2¹ + 1   ->  max NTT size = 2  (useless!)
  985661441  =  235 x 2²²  + 1       ->  max NTT size = 2²² = 4,194,304
  469762049  =  7   x 2²⁶  + 1       ->  max NTT size = 2²⁶ (even bigger!)

How It Works (Detailed Mechanics)

Roots of Unity Primer

The n-th roots of unity are the n complex numbers satisfying zn=1: wnk=e2πik/n=cos(2πk/n)+isin(2πk/n)

Key properties:

  1. Periodicity: wnk+n=wnk
  2. Halving: wn2k=wn/2k (the crucial one for FFT)
  3. Cancellation: wnn/2=1, so wnk+n/2=wnk
  4. Summation: k=0n1wnjk=0 for j0(modn)

Property 3 gives us the "butterfly": we compute E[k]+wkO[k] and E[k]wkO[k] simultaneously.

Cooley-Tukey (Recursive FFT)

FFT(a[0..n-1], w):
  if n == 1: return a
  even = FFT(a[0], a[2], ..., a[n-2], w^2)
  odd  = FFT(a[1], a[3], ..., a[n-1], w^2)
  t = 1
  for k = 0 to n/2 - 1:
    a[k]       = even[k] + t * odd[k]
    a[k + n/2] = even[k] - t * odd[k]
    t = t * w
  return a

Iterative FFT (Bit-Reversal)

The recursive version has overhead from memory allocation. The iterative version works in-place by first permuting the array according to bit-reversal of indices, then building up the butterfly from small to large.

Bit-reversal example (n=8, 3 bits):

Index:     0    1    2    3    4    5    6    7
Binary:  000  001  010  011  100  101  110  111
Reverse: 000  100  010  110  001  101  011  111
Decimal:   0    4    2    6    1    5    3    7

After bit-reversal, the iterative butterflies naturally combine adjacent pairs, then groups of 4, then 8, etc.

FFT vs NTT -- SAME BUTTERFLY, DIFFERENT ARITHMETIC:

  FFT butterfly (complex):            NTT butterfly (modular):
  
  u = a[j]                            u = a[j]
  v = a[j + half] * w                 v = a[j + half] * w % MOD
  
  a[j]        = u + v                 a[j]        = (u + v) % MOD
  a[j + half] = u - v                 a[j + half] = (u - v + MOD) % MOD
  
  w *= wn                             w = w * wn % MOD
  
  ┌─────────────────┬──────────────────────────┐
  │     FFT          │          NTT             │
  ├─────────────────┼──────────────────────────┤
  │ w = e^(2πi/n)   │ w = g^((p-1)/n) mod p   │
  │ complex mul     │ modular mul              │
  │ ÷ n at end      │ x n⁻¹ mod p at end      │
  │ precision ~10¹⁵ │ exact (mod p)            │
  │ uses sin/cos    │ uses modpow              │
  └─────────────────┴──────────────────────────┘

NTT: Modular Arithmetic Version

Replace complex roots of unity with primitive roots modulo a prime.

For prime p with p1=c2k, we can do NTT of size up to 2k. The primitive n-th root of unity mod p is g(p1)/n where g is a primitive root of p.

Common NTT-friendly primes:

Prime            p-1 factorization       Max NTT size    Prim. root
998244353        119 * 2^23              2^23 = 8388608       3
985661441        235 * 2^22              2^22 = 4194304       3
469762049        7   * 2^26             2^26 = 67108864       3
167772161        5   * 2^25             2^25 = 33554432       3

Inverse NTT: Same as NTT but with g1 instead of g, and multiply all results by n1(modp).


Implementation

Recursive FFT (Educational)

cpp
#include <bits/stdc++.h>
using namespace std;

using cd = complex<double>;
const double PI = acos(-1.0);

void fft_recursive(vector<cd>& a, bool invert) {
    int n = a.size();
    if (n == 1) return;

    vector<cd> even(n / 2), odd(n / 2);
    for (int i = 0; i < n / 2; i++) {
        even[i] = a[2 * i];
        odd[i]  = a[2 * i + 1];
    }
    fft_recursive(even, invert);
    fft_recursive(odd, invert);

    double ang = 2 * PI / n * (invert ? -1 : 1);
    cd w(1), wn(cos(ang), sin(ang));
    for (int i = 0; i < n / 2; i++) {
        a[i]         = even[i] + w * odd[i];
        a[i + n / 2] = even[i] - w * odd[i];
        w *= wn;
    }
}

Iterative FFT (Contest-Ready, Doubles)

cpp
#include <bits/stdc++.h>
using namespace std;

using cd = complex<double>;
const double PI = acos(-1.0);

void fft(vector<cd>& a, bool invert) {
    int n = a.size();
    // Bit-reversal permutation
    for (int i = 1, j = 0; i < n; i++) {
        int bit = n >> 1;
        for (; j & bit; bit >>= 1)
            j ^= bit;
        j ^= bit;
        if (i < j) swap(a[i], a[j]);
    }
    // Butterfly computation
    for (int len = 2; len <= n; len <<= 1) {
        double ang = 2 * PI / len * (invert ? -1 : 1);
        cd wlen(cos(ang), sin(ang));
        for (int i = 0; i < n; i += len) {
            cd w(1);
            for (int j = 0; j < len / 2; j++) {
                cd u = a[i + j];
                cd v = a[i + j + len / 2] * w;
                a[i + j]           = u + v;
                a[i + j + len / 2] = u - v;
                w *= wlen;
            }
        }
    }
    if (invert) {
        for (auto& x : a) x /= n;
    }
}

// Multiply two polynomials given as coefficient vectors.
// Returns coefficient vector of the product.
vector<long long> multiply(vector<int>& a, vector<int>& b) {
    vector<cd> fa(a.begin(), a.end()), fb(b.begin(), b.end());
    int n = 1;
    while (n < (int)(a.size() + b.size() - 1)) n <<= 1;
    fa.resize(n);
    fb.resize(n);

    fft(fa, false);
    fft(fb, false);
    for (int i = 0; i < n; i++) fa[i] *= fb[i];
    fft(fa, true);

    vector<long long> res(n);
    for (int i = 0; i < n; i++) res[i] = llround(fa[i].real());
    return res;
}

Precision note: With double, reliable for coefficients up to about 1014. For larger values or exact modular results, use NTT.

NTT (Contest-Ready, mod 998244353)

cpp
#include <bits/stdc++.h>
using namespace std;

const int MOD = 998244353;
const int PRIM_ROOT = 3;

long long power(long long a, long long b, long long mod) {
    long long res = 1;
    a %= mod;
    while (b > 0) {
        if (b & 1) res = res * a % mod;
        a = a * a % mod;
        b >>= 1;
    }
    return res;
}

long long inv(long long a, long long mod) {
    return power(a, mod - 2, mod);
}

void ntt(vector<long long>& a, bool invert) {
    int n = a.size();
    // Bit-reversal permutation
    for (int i = 1, j = 0; i < n; i++) {
        int bit = n >> 1;
        for (; j & bit; bit >>= 1)
            j ^= bit;
        j ^= bit;
        if (i < j) swap(a[i], a[j]);
    }
    // Butterfly computation
    for (int len = 2; len <= n; len <<= 1) {
        long long w = invert ? inv(PRIM_ROOT, MOD) : PRIM_ROOT;
        w = power(w, (MOD - 1) / len, MOD);
        for (int i = 0; i < n; i += len) {
            long long wn = 1;
            for (int j = 0; j < len / 2; j++) {
                long long u = a[i + j];
                long long v = a[i + j + len / 2] * wn % MOD;
                a[i + j]           = (u + v) % MOD;
                a[i + j + len / 2] = (u - v + MOD) % MOD;
                wn = wn * w % MOD;
            }
        }
    }
    if (invert) {
        long long n_inv = inv(n, MOD);
        for (auto& x : a) x = x * n_inv % MOD;
    }
}

// Multiply two polynomials modulo MOD.
vector<long long> multiply_ntt(vector<long long>& a, vector<long long>& b) {
    vector<long long> fa(a), fb(b);
    int n = 1;
    while (n < (int)(a.size() + b.size() - 1)) n <<= 1;
    fa.resize(n);
    fb.resize(n);

    ntt(fa, false);
    ntt(fb, false);
    for (int i = 0; i < n; i++) fa[i] = fa[i] * fb[i] % MOD;
    ntt(fa, true);

    fa.resize(a.size() + b.size() - 1);
    return fa;
}

Convolution Wrapper (Frequency Array)

A very common pattern: given arrays of counts, compute the "sum convolution."

cpp
// Given freq[v] = number of elements with value v,
// compute conv[s] = number of pairs (i, j) with a[i] + a[j] = s.
vector<long long> sum_convolution(vector<int>& freq) {
    // Self-convolution via FFT
    vector<cd> fa(freq.begin(), freq.end());
    int n = 1;
    while (n < (int)(2 * freq.size() - 1)) n <<= 1;
    fa.resize(n);
    fft(fa, false);
    for (int i = 0; i < n; i++) fa[i] *= fa[i];
    fft(fa, true);
    vector<long long> res(n);
    for (int i = 0; i < n; i++) res[i] = llround(fa[i].real());
    return res;
}

Arbitrary Modulus Convolution (Split Trick)

When the modulus is not NTT-friendly, split coefficients into high and low 15-bit halves and use 3 FFTs:

cpp
// Multiply polynomials mod any prime using FFT with doubles.
// Split each coefficient a[i] = a_hi[i] * BASE + a_lo[i].
// Then a*b = (a_hi*b_hi)*BASE^2 + (a_hi*b_lo + a_lo*b_hi)*BASE + a_lo*b_lo
// Each sub-product is computed via FFT on doubles (fits in precision).
vector<long long> multiply_mod(vector<int>& a, vector<int>& b, long long mod) {
    const int BASE = 1 << 15; // 32768
    int need = a.size() + b.size() - 1;
    int n = 1;
    while (n < need) n <<= 1;

    vector<cd> a_lo(n), a_hi(n), b_lo(n), b_hi(n);
    for (int i = 0; i < (int)a.size(); i++) {
        a_lo[i] = a[i] % BASE;
        a_hi[i] = a[i] / BASE;
    }
    for (int i = 0; i < (int)b.size(); i++) {
        b_lo[i] = b[i] % BASE;
        b_hi[i] = b[i] / BASE;
    }

    fft(a_lo, false); fft(a_hi, false);
    fft(b_lo, false); fft(b_hi, false);

    // Combine: 4 pointwise products
    vector<cd> hh(n), hl(n), lh(n), ll(n);
    for (int i = 0; i < n; i++) {
        hh[i] = a_hi[i] * b_hi[i];
        hl[i] = a_hi[i] * b_lo[i];
        lh[i] = a_lo[i] * b_hi[i];
        ll[i] = a_lo[i] * b_lo[i];
    }

    fft(hh, true); fft(hl, true); fft(lh, true); fft(ll, true);

    vector<long long> res(need);
    for (int i = 0; i < need; i++) {
        long long v_hh = llround(hh[i].real()) % mod;
        long long v_hl = llround(hl[i].real()) % mod;
        long long v_lh = llround(lh[i].real()) % mod;
        long long v_ll = llround(ll[i].real()) % mod;
        res[i] = ((v_hh % mod * BASE % mod * BASE % mod)
                + (v_hl + v_lh) % mod * BASE % mod
                + v_ll) % mod;
        res[i] = (res[i] % mod + mod) % mod;
    }
    return res;
}

Patterns & Applications

Pattern 1: Direct Polynomial Multiplication

When: The problem literally asks you to multiply two polynomials or compute a convolution of two sequences.

Given two polynomials A and B, compute C = A * B.

Solution: Direct FFT/NTT multiply.
Time: O(n log n)

Pattern 2: Counting with Generating Functions

When: "Count the number of ways to pick items whose values sum to S."

Idea: Create a generating function where the coefficient of xk represents the number of ways to achieve sum k. Multiplying generating functions corresponds to combining independent choices.

Example: You have n dice, each showing values 1 to 6. Count the number of ways to get each possible total.

cpp
// Each die contributes: x^1 + x^2 + x^3 + x^4 + x^5 + x^6
// For n dice, multiply this polynomial with itself n times.
// Use repeated squaring: O(n * S * log n) via FFT, where S = 6n.
vector<long long> dice_ways(int n) {
    // Single die polynomial: coefficients for x^1..x^6
    vector<long long> single = {0, 1, 1, 1, 1, 1, 1};
    vector<long long> result = {1}; // identity: x^0

    // Binary exponentiation on polynomials
    auto poly = single;
    int exp = n;
    while (exp > 0) {
        if (exp & 1) result = multiply_ntt(result, poly);
        poly = multiply_ntt(poly, poly);
        exp >>= 1;
    }
    return result; // result[s] = number of ways to get sum s
}
CONVOLUTION TYPE -> ALGORITHM DECISION TREE:

  What operation combines indices?

  ├─ i + j = k  (additive)
  │  ├─ Need exact mod p?
  │  │  ├─ p is NTT-friendly (p = c*2ᵏ+1, k large) -> NTT mod p
  │  │  ├─ p = 10⁹+7 (not NTT-friendly) -> split-coefficient FFT (7 FFTs)
  │  │  └─ arbitrary p -> CRT with 2-3 NTT-friendly primes
  │  └─ No modulus / small values -> FFT on doubles + llround

  ├─ i OR j = k  -> SOS (zeta/Möbius) transform
  ├─ i AND j = k -> superset-sum transform
  ├─ i XOR j = k -> Walsh-Hadamard Transform (WHT)

  └─ i * j = k (multiplicative) -> Dirichlet convolution
     (no fast transform in general; use factorization tricks)
CHOOSING YOUR FFT/NTT VARIANT -- QUICK REFERENCE:

  ┌─────────────────────────────────────────────────────────────┐
  │                   START HERE                                │
  │                      │                                      │
  │          Need modular arithmetic?                           │
  │           ╱              ╲                                   │
  │         YES               NO                                │
  │          │                 │                                 │
  │    What modulus?      Values small                           │
  │     ╱    │     ╲      enough for                             │
  │   NTT-  10⁹+7  Other doubles?                               │
  │  friendly        mod    ╱    ╲                               │
  │    │       │      │   YES     NO                             │
  │    ↓       ↓      ↓    ↓      ↓                             │
  │   NTT   Split   CRT  FFT   Multi-                           │
  │   mod p  coeff  with  with  prime                            │
  │         FFT    2-3    llround NTT                            │
  │         (7     NTT          + CRT                            │
  │         FFTs)  primes                                        │
  └─────────────────────────────────────────────────────────────┘

  Decision factors:
  ┌──────────────────┬────────────┬──────────────────────┐
  │ Variant          │ Speed      │ When to use          │
  ├──────────────────┼────────────┼──────────────────────┤
  │ NTT mod p        │ Fastest    │ p is NTT-friendly    │
  │ FFT doubles      │ Fast       │ n*v² < 10¹⁵         │
  │ Split-coeff FFT  │ ~7x FFT   │ mod 10⁹+7           │
  │ Multi-prime NTT  │ ~3x NTT   │ arbitrary modulus    │
  └──────────────────┴────────────┴──────────────────────┘

Pattern 3: Frequency Convolution (Pair Counting)

When: "How many pairs (i,j) satisfy ai+aj=k for each k?"

Build a frequency array freq[v] and self-convolve. The result at index k counts ordered pairs with sum k.

cpp
// freq[v] = count of elements equal to v
// After convolution: res[k] = sum over all v of freq[v] * freq[k-v]
//                            = number of ordered pairs summing to k

Variations:

  • Difference instead of sum: reverse one array, then convolve.
  • Product: take logarithms to convert to sums (with care for precision).
FREQUENCY SELF-CONVOLUTION -- counting pair sums:

  Array: [1, 3, 2, 3, 5]
  
  Frequency:  freq[1]=1, freq[2]=1, freq[3]=2, freq[5]=1
  
  Convolution C = freq ⊛ freq:
  
  C[k] = Σ freq[i] * freq[k-i]
  
  C[2] = freq[1]*freq[1] = 1     (pair: 1+1)
  C[3] = freq[1]*freq[2] + freq[2]*freq[1] = 2  (pairs: 1+2, 2+1)
  C[4] = freq[1]*freq[3] + freq[2]*freq[2] + freq[3]*freq[1] = 5
  C[6] = freq[1]*freq[5] + freq[3]*freq[3] + freq[5]*freq[1] = 6
  
  Value:  0   1   2   3   4   5   6   7   8  ...
  freq:   0   1   1   2   0   1   0   0   0
  conv:   0   0   1   2   5   4   6   4   1  ...
              ^               ^
              |               |
         1+1=2 (1 way)  1+3=4, 2+2=4, 3+1=4 (5 ordered ways)

Pattern 4: String Matching via Convolution

When: "Find all positions where pattern P matches text T, allowing wildcards."

Encode characters as numbers. A match at position s means j=0m1(T[s+j]P[j])2wT[s+j]wP[j]=0 where w is 0 at wildcard positions, 1 otherwise.

Expand the squared term and compute each part via convolution (reversing P).

cpp
// Simplified: wildcard = 0 in both T and P
// Match score at position s:
//   sum_j T[s+j]^2 * P[j] - 2 * T[s+j] * P[j]^2 + P[j]^3 ... (expanded)
// Each sum is a convolution after reversing P.

Pattern 5: Bitwise Convolution (OR/AND/XOR)

Standard convolution computes C[k]=i+j=kA[i]B[j].

OR convolution: C[k]=i|j=kA[i]B[j] Uses the sum-over-subsets (SOS) / Mobius transform instead of FFT.

XOR convolution: C[k]=ij=kA[i]B[j] Uses the Walsh-Hadamard Transform (WHT).

cpp
// Walsh-Hadamard Transform (XOR convolution)
void wht(vector<long long>& a, bool invert) {
    int n = a.size();
    for (int len = 1; len < n; len <<= 1) {
        for (int i = 0; i < n; i += len << 1) {
            for (int j = 0; j < len; j++) {
                long long u = a[i + j];
                long long v = a[i + j + len];
                a[i + j]       = (u + v) % MOD;
                a[i + j + len] = (u - v + MOD) % MOD;
            }
        }
    }
    if (invert) {
        long long inv_n = inv(n, MOD);
        for (auto& x : a) x = x * inv_n % MOD;
    }
}

// OR convolution (subset-sum / zeta transform)
void or_transform(vector<long long>& a, bool invert) {
    int n = a.size();
    for (int len = 1; len < n; len <<= 1) {
        for (int i = 0; i < n; i += len << 1) {
            for (int j = 0; j < len; j++) {
                if (!invert)
                    a[i + j + len] = (a[i + j + len] + a[i + j]) % MOD;
                else
                    a[i + j + len] = (a[i + j + len] - a[i + j] + MOD) % MOD;
            }
        }
    }
}

// AND convolution (superset-sum / zeta transform)
void and_transform(vector<long long>& a, bool invert) {
    int n = a.size();
    for (int len = 1; len < n; len <<= 1) {
        for (int i = 0; i < n; i += len << 1) {
            for (int j = 0; j < len; j++) {
                if (!invert)
                    a[i + j] = (a[i + j] + a[i + j + len]) % MOD;
                else
                    a[i + j] = (a[i + j] - a[i + j + len] + MOD) % MOD;
            }
        }
    }
}

// Usage: XOR convolution of two arrays
vector<long long> xor_convolve(vector<long long> a, vector<long long> b) {
    int n = 1;
    while (n < max(a.size(), b.size())) n <<= 1;
    a.resize(n); b.resize(n);
    wht(a, false); wht(b, false);
    for (int i = 0; i < n; i++) a[i] = a[i] * b[i] % MOD;
    wht(a, true);
    return a;
}

Pattern 6: Power of a Polynomial / Repeated Squaring

When: You need P(x)k for large k.

Combine polynomial multiplication with binary exponentiation. Cost: O(nlognlogk) where n is the degree of the result.

cpp
vector<long long> poly_power(vector<long long> base, int k, int max_deg) {
    vector<long long> result = {1};
    while (k > 0) {
        if (k & 1) {
            result = multiply_ntt(result, base);
            if ((int)result.size() > max_deg + 1)
                result.resize(max_deg + 1);
        }
        base = multiply_ntt(base, base);
        if ((int)base.size() > max_deg + 1)
            base.resize(max_deg + 1);
        k >>= 1;
    }
    return result;
}

Complexity Analysis

OperationTimeSpace
FFT (forward or inverse)O(nlogn)O(n)
NTT (forward or inverse)O(nlogn)O(n)
Polynomial multiplyO(nlogn)O(n)
WHT / OR / AND transformO(nlogn)O(n)
Arbitrary-mod multiply (split)O(nlogn)O(n)
Poly power via binary expO(nlognlogk)O(n)

Constant factors:

  • FFT (doubles): ~4-5x slower than NTT due to floating point and sin/cos.
  • NTT (mod 998244353): Fastest for modular problems.
  • The split trick for arbitrary modulus uses ~7 FFTs total.

When n is the degree bound: Pad to $n' = $ next power of 2 2n. For two degree-d polynomials, the product has degree 2d, so pad to n2d+1.

POLYNOMIAL SIZE -> FFT SIZE -> RESULT SIZE:

  Input:  A has |A| coefficients,  B has |B| coefficients
  Result: C = A*B has |A|+|B|-1 coefficients
  FFT needs: n_fft = next power of 2 >= |A|+|B|-1

  Example:  |A|=3,  |B|=4

  A:  [a₀  a₁  a₂]           3 coefficients
  B:  [b₀  b₁  b₂  b₃]      4 coefficients
  C:  [c₀  c₁  c₂  c₃  c₄  c₅]   = 3+4-1 = 6 coefficients

  FFT size = next_pow2(6) = 8

  Padded A:  [a₀  a₁  a₂  0   0   0   0   0 ]   (8 entries)
  Padded B:  [b₀  b₁  b₂  b₃  0   0   0   0 ]   (8 entries)
  FFT out:   [c₀  c₁  c₂  c₃  c₄  c₅  0   0 ]   (8 entries)
  Result:    [c₀  c₁  c₂  c₃  c₄  c₅]            (take first 6)

  Work done: 3 FFTs of size 8 = 3 x 8 x log₂(8) = 72 operations
  If we padded to 16 instead:  3 x 16 x 4 = 192 operations  (2.67x slower!)
ROOTS OF UNITY ON THE UNIT CIRCLE (n=8):

              ω¹ = e^(2πi/8)
               *
          ω²  ╱     ╲  ω⁰ = 1
            *         *
           ╱           ╲
     ω³  *───────────────*  ω⁷     <- the 8 points where
           ╲           ╱             we evaluate the polynomial
            *         *
          ω⁴  ╲     ╱  ω⁶
               *
              ω⁵

  Key properties:
  • ω⁸ = 1  (they're 8th roots of unity)
  • ωᵏ⁺⁴ = -ωᵏ  (half-circle symmetry -> butterfly trick)
  • Sum of all roots = 0  (cancellation lemma)
  • In NTT: replace unit circle with cyclic group mod p
    ω = g^((p-1)/n) mod p,  where g is a primitive root
RUNTIME SCALING -- WHY CONSTANTS MATTER IN O(n log n):

  FFT size (n)   log₂n   n*log₂n     Relative to 2¹⁷
  ──────────────────────────────────────────────────────
   2¹⁵ =  32K     15       491K       0.14x
   2¹⁶ =  64K     16      1049K       0.30x
   2¹⁷ = 128K     17      2228K       1.00x  <- baseline
   2¹⁸ = 256K     18      4718K       2.12x
   2¹⁹ = 512K     19      9961K       4.47x
   2²⁰ =   1M     20     20972K       9.41x
   2²¹ =   2M     21     44040K      19.77x
   2²² =   4M     22     92274K      41.42x
   2²³ =   8M     23    192937K      86.60x  <- 998244353 max

  Key takeaway: going from 2¹⁷ -> 2²⁰ is ~9x slower,
  not just "a little more padding." Every extra bit in
  the exponent roughly doubles the work.

  Each FFT multiply does 3 transforms (FFT, FFT, IFFT),
  so total work = 3 x n x log₂n butterfly operations.
MEMORY LAYOUT -- CACHE EFFECTS IN FFT:

  Small n (fits in L1/L2 cache):
  ┌──────────────────────────────────────┐
  │ All butterfly operations hit cache   │
  │ -> near-peak throughput               │
  └──────────────────────────────────────┘

  Large n (exceeds cache):
  ┌──────────────────────────────────────┐
  │ Later stages stride > cache line     │
  │ -> cache misses on every access       │
  │ -> 3-5x slower per operation          │
  └──────────────────────────────────────┘

  Practical consequence:
    n = 2¹⁸ (256K complex doubles = 4 MB) ~= L3 boundary
    n = 2²⁰ (1M complex doubles = 16 MB)  -> cache-hostile
    -> NTT (mod arithmetic, no doubles) uses half the memory

Edge Cases & Gotchas

  1. Forgetting to pad to power of 2: The FFT/NTT size must be a power of 2 and at least |A|+|B|1.

  2. Precision with doubles: llround instead of (long long) to avoid off-by-one rounding errors. Coefficients above 1015 may lose precision.

  3. Modular inverse existence: For NTT, n (the FFT size) must be invertible mod p. Since p is prime and n is a power of 2 that divides p1, this is always satisfied for NTT-friendly primes.

  4. Off-by-one in result size: Product of polynomials of sizes a and b has size a+b1, not a+b.

  5. NTT overflow: Intermediate products a[i]wn can overflow long long if both are close to 1018. Use __int128 or ensure values are reduced mod p before multiplication (they should be if done correctly).

  6. Negative coefficients in NTT: Add p before taking mod to avoid negative values: (u - v + MOD) % MOD.

  7. Self-convolution double counting: When convolving a frequency array with itself to count pairs, C[k] counts ordered pairs. For unordered pairs, handle i=j and ij separately.

  8. Reusing arrays between test cases: Clear or resize properly. A common bug is leftover data from a previous FFT call.

Mental Traps

"FFT is just an efficient algorithm. I don't need to understand the math." You can use it as a black box -- until you need to modify it. Handling non-power-of-2 sizes, choosing between FFT and NTT, understanding why precision fails at large coefficient values, recognizing that the XOR convolution needs WHT not FFT -- all of these require knowing what the algorithm actually does. The "just copy-paste the template" approach breaks at the first non-standard application.

"If I'm working modulo some prime, use NTT." Only if the prime is NTT-friendly (p=c2k+1 with large enough k). If the modulus is 109+7 (which is not NTT-friendly), you need either the split-coefficient FFT trick or a different approach. Reaching for NTT when the modulus doesn't support it gives wrong answers, not errors.

"Convolution and polynomial multiplication are different things." They're not. The convolution of arrays a and b at index k is i+j=ka[i]b[j]. This is exactly the coefficient of xk in the product polynomial A(x)B(x). Recognizing "convolution" problems as "polynomial multiplication" problems (and vice versa) doubles the situations where you know FFT/NTT applies.

"I need exact integer arithmetic, so doubles won't work." With llround, FFT on doubles gives exact integer results as long as the intermediate values don't exceed 1015. For two arrays of length n with values up to v, the maximum coefficient in the product is nv2. If nv2<1015, doubles are fine. Otherwise, use NTT.

"I'll just make the FFT size much bigger to be safe." Doubling n adds a full extra butterfly stage -- O(nlogn) means the constant matters. Padding to 220 when 217 suffices costs ~3.5x more work. Worse, if n exceeds the NTT prime's max supported transform size (223 for 998244353), the transform silently produces garbage because the required root of unity doesn't exist.

"The result polynomial has the same size as my input arrays." The product of polynomials of sizes |A| and |B| has exactly |A|+|B|1 coefficients. Reading only max(|A|,|B|) entries misses high-degree terms; reading |A|+|B| includes one garbage trailing zero.

POLYNOMIAL SIZE ARITHMETIC:

  A = [a₀, a₁, a₂]         |A| = 3  (degree 2)
  B = [b₀, b₁, b₂, b₃]    |B| = 4  (degree 3)

  Product C = A x B:
    degree(C) = 2 + 3 = 5
    |C| = 3 + 4 - 1 = 6    <- THIS is how many coefficients

  FFT size needed:
    next_power_of_2(6) = 8

  Common mistakes:
    X Reading |C| = max(3,4) = 4   -> misses c₄, c₅
    X Reading |C| = 3 + 4 = 7      -> c₆ is garbage (always 0, but wasteful)
    X FFT size = max(3,4) = 4      -> aliasing! Results wrap around
    OK Reading |C| = 3 + 4 - 1 = 6  -> correct
    OK FFT size = 8                  -> correct (next power of 2 >= 6)

"I'll just increase the FFT size to be safe." Doubling the FFT size doubles the work (O(nlogn) is not O(n)). Going from 217 to 218 adds ~260K extra operations -- noticeable in tight time limits. Worse, if you're using NTT, exceeding the prime's supported transform size silently produces garbage. For 998244353=119×223+1, the maximum NTT size is 223=8,388,608. Padding to 224 doesn't just waste time -- it gives wrong answers because the required primitive root of order 224 doesn't exist mod this prime.

"The result array has the same size as the input." The product of a degree-(a1) and a degree-(b1) polynomial has degree a+b2, so |C|=|A|+|B|1 coefficients. Reading only max(|A|,|B|) entries silently drops high-order terms. Reading all nfft entries includes trailing zeros (or floating-point noise in FFT). Always resize the result to exactly |A|+|B|1.

SIZE RELATIONSHIPS IN POLYNOMIAL MULTIPLICATION:

  A has |A| coefficients (degree |A|-1)
  B has |B| coefficients (degree |B|-1)
  C = A*B has |A|+|B|-1 coefficients

  Example: A = [1, 2, 3]  (|A|=3)
           B = [4, 5, 6, 7]  (|B|=4)

  Product size = 3 + 4 - 1 = 6 coefficients
  FFT size     = next power of 2 >= 6 = 8

  ┌───┬───┬───┬───┬───┬───┬───┬───┐
  │ c₀│ c₁│ c₂│ c₃│ c₄│ c₅│ 0 │ 0 │  <- FFT output (size 8)
  └───┴───┴───┴───┴───┴───┴───┴───┘
  ├─── result: take first 6 ───┤ ├junk┤

  Common bugs:
  X result.resize(max(|A|,|B|)) = 4  -> loses c₄, c₅
  X result.resize(n_fft) = 8         -> includes garbage zeros
  OK result.resize(|A|+|B|-1) = 6     -> correct

Trap: "Self-convolution counts each pair once."

Self-convolution of a frequency array counts ordered pairs. If freq = [0, 2, 1] (two 1s, one 2), then conv[2] = freq[1]*freq[1] = 4, which counts pairs (1,1), (1,1), (1,1), (1,1) -- all four ordered selections of two elements from {1, 1}. For unordered pairs, you need to subtract the i=j diagonal and divide by 2. Forgetting this adjustment is a classic source of WA.

SELF-CONVOLUTION PAIR COUNTING:

  Array: [1, 1, 2]    freq[1]=2, freq[2]=1

  Raw convolution result for sum k:
    conv[2] = freq[0]*freq[2] + freq[1]*freq[1] + freq[2]*freq[0]
            = 0*1 + 2*2 + 1*0 = 4    <- ORDERED pairs summing to 2

  Ordered pairs summing to 2: (1,1) x 4 ways
    (elem 0 + elem 1), (elem 1 + elem 0),
    (elem 0 + elem 0), (elem 1 + elem 1)  <- includes i=j!

  Adjustments for UNORDERED DISTINCT pairs:
    1. Subtract i=j: conv[2k] -= freq[k]  (for each k)
    2. Divide by 2:  conv[s] /= 2

  ┌──────────────────────────────────────────────┐
  │ Raw conv = ordered pairs (including i=j)     │
  │ Subtract diagonal -> ordered pairs (i!=j)      │
  │ Divide by 2 -> unordered pairs (i!=j)          │
  └──────────────────────────────────────────────┘

Trap: "I can do NTT with any prime, as long as I have a primitive root."

You need a primitive root of order exactly n (your FFT size), meaning n(p1). Having a primitive root modulo p (order p1) isn't enough -- you need p1 to be divisible by n. For p=109+7, since p1=2×500000003, the only power-of-2 NTT size supported is n=2. That's why 998244353 (=119×223+1) is the standard choice.

NTT PRIME COMPATIBILITY CHECK:

  For NTT of size n (must be power of 2):
    REQUIREMENT: n divides (p - 1)

  p = 998244353:  p-1 = 998244352 = 119 x 2²³
    Max n = 2²³ = 8,388,608  OK (plenty for contests)

  p = 10⁹ + 7:   p-1 = 10⁹ + 6 = 2 x 500000003
    Max n = 2¹ = 2            X (useless!)

  p = 469762049:  p-1 = 469762048 = 7 x 2²⁶
    Max n = 2²⁶ = 67,108,864  OK (even bigger!)

  ┌───────────────────────────────────────────────┐
  │ Before using NTT, ALWAYS verify:              │
  │   1. p is prime                               │
  │   2. Find largest power of 2 dividing p-1     │
  │   3. That power >= your required FFT size      │
  │   4. Know the primitive root (usually g=3)    │
  └───────────────────────────────────────────────┘

Debugging Checklist

When your FFT/NTT solution gives wrong answers:

[ ] Is n (FFT size) a power of 2 and >= sum of input sizes?
[ ] Did you resize BOTH input arrays to n before FFT?
[ ] Is the result array read up to the correct index (a.size() + b.size() - 2)?
[ ] For FFT: are you using llround(), not (long long) cast?
[ ] For NTT: is the modulus NTT-friendly? Does 2^k divide (p-1)?
[ ] For NTT: is the primitive root correct for your modulus?
[ ] Is the inverse FFT dividing by n (or multiplying by n^{-1} mod p)?
[ ] For convolution: did you reverse the correct array?
[ ] Are intermediate multiplications overflowing?
[ ] For multiple test cases: are you clearing/resizing properly?

Quick sanity test: Multiply [1,1]×[1,1]. Expected: [1,2,1] (since (1+x)(1+x)=1+2x+x2).

cpp
// Debug helper: test basic multiplication
void test_fft() {
    vector<int> a = {1, 1}, b = {1, 1};
    auto c = multiply(a, b);
    // c should be [1, 2, 1]
    assert(c[0] == 1 && c[1] == 2 && c[2] == 1);
}

Worked Examples

Example 0: Step-by-Step FFT Multiply -- (1+2x)×(3+x)

Goal: Compute (1+2x)(3+x)=3+7x+2x2 using the FFT pipeline.

A=[1,2] (coefficients of 1+2x), B=[3,1] (coefficients of 3+x). Product has 2+21=3 coefficients, so pad to n=4 (next power of 2).

STEP 0: PAD TO n=4
  A = [1, 2, 0, 0]     B = [3, 1, 0, 0]

STEP 1: EVALUATE (forward FFT at 4th roots of unity)
  The 4th roots of unity: w_4^0=1, w_4^1=i, w_4^2=-1, w_4^3=-i

  A(w^0) = 1 + 2(1)  + 0 + 0 = 3
  A(w^1) = 1 + 2(i)  + 0 + 0 = 1 + 2i
  A(w^2) = 1 + 2(-1) + 0 + 0 = -1
  A(w^3) = 1 + 2(-i) + 0 + 0 = 1 - 2i

  B(w^0) = 3 + 1(1)  + 0 + 0 = 4
  B(w^1) = 3 + 1(i)  + 0 + 0 = 3 + i
  B(w^2) = 3 + 1(-1) + 0 + 0 = 2
  B(w^3) = 3 + 1(-i) + 0 + 0 = 3 - i

  A^ = [3,  1+2i,  -1,  1-2i]
  B^ = [4,  3+i,    2,  3-i ]

STEP 2: POINTWISE MULTIPLY
  C^[0] = 3 * 4          = 12
  C^[1] = (1+2i)(3+i)    = 3+i+6i+2i^2 = 3+7i-2 = 1+7i
  C^[2] = (-1) * 2       = -2
  C^[3] = (1-2i)(3-i)    = 3-i-6i+2i^2 = 3-7i-2 = 1-7i

  C^ = [12,  1+7i,  -2,  1-7i]

STEP 3: INTERPOLATE (inverse FFT = evaluate at w^{-1}, divide by n)
  Use IFFT formula: c[k] = (1/n) * sum_{j=0}^{n-1} C^[j] * w^{-jk}

  c[0] = (12 + (1+7i) + (-2) + (1-7i)) / 4 = 12/4 = 3
  c[1] = (12 + (1+7i)(w^{-1}) + (-2)(w^{-2}) + (1-7i)(w^{-3})) / 4
       = (12 + (1+7i)(-i) + (-2)(-1) + (1-7i)(i)) / 4
       = (12 + (-i+7) + 2 + (i+7)) / 4 = 28/4 = 7
  c[2] = (12 + (1+7i)(-1) + (-2)(1) + (1-7i)(-1)) / 4
       = (12 - 1 - 7i - 2 - 1 + 7i) / 4 = 8/4 = 2
  c[3] = 0  (as expected -- no x^3 term)

RESULT:  C = [3, 7, 2]  -->  3 + 7x + 2x^2   OK

Example 1: Big Number Multiplication

Problem: Multiply two large numbers represented as digit strings.

Approach: Treat each number as a polynomial where coefficient i is the i-th digit (from least significant). Multiply via FFT. Propagate carries.

cpp
string big_multiply(string& s1, string& s2) {
    int n1 = s1.size(), n2 = s2.size();
    vector<int> a(n1), b(n2);
    // Reverse so index 0 = least significant digit
    for (int i = 0; i < n1; i++) a[i] = s1[n1 - 1 - i] - '0';
    for (int i = 0; i < n2; i++) b[i] = s2[n2 - 1 - i] - '0';

    auto c = multiply(a, b);

    // Propagate carries
    string res;
    long long carry = 0;
    for (int i = 0; i < (int)c.size() || carry; i++) {
        long long cur = carry + (i < (int)c.size() ? c[i] : 0);
        res += (char)('0' + cur % 10);
        carry = cur / 10;
    }
    // Remove leading zeros
    while (res.size() > 1 && res.back() == '0') res.pop_back();
    reverse(res.begin(), res.end());
    return res;
}

Example 2: Counting Subset Sums

Problem: Given n items with values v1,,vn, count the number of subsets whose values sum to each possible total S (0SV), where V=vi.

Approach: The generating function for item i is (1+xvi). The answer is the product of all these polynomials. Use divide-and-conquer polynomial multiplication to avoid O(n) sequential multiplications.

cpp
// D&C polynomial multiplication of (1 + x^{v[0]}) * ... * (1 + x^{v[n-1]})
vector<long long> dc_multiply(vector<vector<long long>>& polys, int l, int r) {
    if (l == r) return polys[l];
    int mid = (l + r) / 2;
    auto left = dc_multiply(polys, l, mid);
    auto right = dc_multiply(polys, mid + 1, r);
    return multiply_ntt(left, right);
}

vector<long long> count_subset_sums(vector<int>& values) {
    vector<vector<long long>> polys;
    for (int v : values) {
        vector<long long> p(v + 1, 0);
        p[0] = 1;
        p[v] = 1;
        polys.push_back(p);
    }
    return dc_multiply(polys, 0, (int)polys.size() - 1);
}

Example 3: Codeforces-Style Problem

Problem (CF 954I style): Given a binary text T and pattern P (with wildcards ?), find all positions in T where P matches.

Approach: Define t[i], p[j] as the character values (0/1, with wildcard =0). A match at position s exists iff: j(t[s+j]p[j])2p[j]t[s+j]=0 (wildcards zeroed).

Reverse p and this becomes a convolution.

cpp
vector<int> wildcard_match(string& text, string& pat) {
    int n = text.size(), m = pat.size();

    // Encode: '1' -> 1, '0' -> -1 (or keep as 0/1), '?' -> 0
    // For simplicity, use the direct squared-difference approach
    // with 3 convolutions.

    // t[i], p[j]: character value (0 for wildcard)
    // Match at s: sum_j p[j]^3 * t[s+j] - 2*p[j]^2*t[s+j]^2 + p[j]*t[s+j]^3
    // = sum_j t[s+j]*p[j]*(t[s+j]-p[j])^2

    // Reverse p for convolution
    // ... (3 convolutions, check if result[s] == 0)

    // Placeholder for full implementation:
    vector<int> match_positions;
    // (see pattern 4 above for the full technique)
    return match_positions;
}

Self-Test

Without looking at the reference material, you should be able to:

  1. Write the iterative FFT from memory -- including bit-reversal permutation, butterfly loop, and the invert flag for inverse FFT.

  2. Write the multiply wrapper -- computing the required FFT size, padding both arrays, forward FFT, pointwise multiply, inverse FFT, and rounding results.

  3. State when to use FFT vs. NTT -- and which moduli are NTT-friendly. Why is 998244353 special? Why is 109+7 problematic?

  4. Recognize a frequency-array convolution problem -- "count pairs summing to k" -> build freq array -> self-convolve.

  5. State the WHT for XOR convolution -- at least at the level of "it's a different transform, same structure."

  6. Explain the bit-reversal permutation -- why does the iterative FFT reorder inputs this way, and what happens if you skip it?

  7. Trace through a small FFT by hand -- e.g., multiply [1,1]×[1,1] through the full evaluate -> pointwise multiply -> interpolate pipeline and verify the result is [1,2,1].

  • [ ] Given arrays of sizes 5 and 7, compute the exact FFT size needed and the exact result array size. (Answer: result size = 11, FFT size = 16.)
  • [ ] Compute g(p1)/8modp for p=998244353, g=3 to find the primitive 8th root of unity used in an NTT of size 8.
  • [ ] Manually verify that llround gives correct results for multiplying [105,105]×[105,105] via FFT on doubles. What is the largest coefficient, and is it below 1015?
  • [ ] Write a modular-arithmetic multiply wrapper for NTT that handles: (a) computing FFT size, (b) padding both arrays, (c) forward NTT, (d) pointwise multiply mod p, (e) inverse NTT with n1 scaling, (f) trimming the result to |A|+|B|1.
  • [ ] Given a problem "count the number of ways to pick two elements from an array with sum k," write the full solution using frequency-array self-convolution, correctly handling the i=j and ordered-vs-unordered pair adjustments.
  • [ ] Given |A|=500000 and |B|=300000, compute the required FFT size (220=1048576799999) and verify it fits within 998244353's max NTT size (223).
  • [ ] Compute the primitive 8-th root of unity modulo 998244353: g(p1)/8=3(9982443531)/8mod998244353. Verify it has order exactly 8.
  • [ ] For arrays with values up to 106 and length 105, compute the max coefficient (nv2=1017) and decide: FFT-doubles or NTT? (Answer: NTT -- exceeds 1015 precision limit.)
  • [ ] Write an NTT-based multiply function that works mod 998244353, test it on [1,1]×[1,1,1] -> [1,2,2,1].
  • [ ] Given a problem "count unordered pairs (i,j) with ai+aj=k", write the frequency-array convolution and handle the i=j double-counting correction.
  • [ ] Multiply [1,2,3]×[4,5] by hand through the full FFT pipeline: pad to size 4, compute 4th roots of unity, evaluate both polynomials, pointwise multiply, inverse FFT, and verify [4,13,22,15].
  • [ ] Given a frequency array freq = [0, 3, 0, 2, 0, 1] (three 1s, two 3s, one 5), compute the self-convolution entry for sum k=4 and explain the ordered-vs-unordered pair adjustment.
  • [ ] Write the bit-reversal permutation for n=16 -- list all 16 indices and their bit-reversed counterparts in a table.
  • [ ] Explain why padding to 218 when 217 suffices roughly doubles the FFT runtime -- relate this to the O(nlogn) formula and compute the actual operation counts.
  • [ ] For an NTT mod 998244353, compute the primitive 4th root of unity: 3(9982443531)/4mod998244353. Verify it has order exactly 4 by checking that its square equals 1(modp).

Practice Problems

#ProblemKey TechniqueDifficulty
1SPOJ POLYMULDirect polynomial multiplicationCF ~1600
2CF 632E - Thief in a ShopGenerating functions + poly powerCF 2000
3CF 993E - Nikita and Order StatisticsCounting via convolutionCF 1900
4CF 954I - Yet Another String MatchingString matching via FFTCF 2200
5CF 958F3 - Lightsabers (hard)NTT + generating functionsCF 2100
6CF 623E - Transforming SequenceNTT + binary lifting on polysCF 2600
7Atcoder ATC001 C - Fast Fourier TransformBasic FFT practiceCF ~1700
8CF 528D - Fuzzy SearchFFT for approximate matchingCF 2000
9CF 1096G - Lucky TicketsNTT on frequency arraysCF 2000
10CF 300D - Painting SquaresFFT + combinatoricsCF 2100
11Apples and BananasConvolution of two frequency arraysCSES 2000
12One Bit PositionsBitstring autocorrelation via FFTCSES 2000
13Signal ProcessingCross-correlation of two signalsCSES 2000

Suggested progression:

  1. Start with SPOJ POLYMUL and Atcoder ATC001 C to verify your template.
  2. Move to CF 993E and 1096G for standard NTT applications.
  3. Try CF 632E for polynomial exponentiation.
  4. Tackle CF 528D and 954I for convolution-based string matching.
  5. Attempt CF 623E for a challenging NTT + D&C combination.

Summary

FFT / NTT Cheat Sheet
=====================

WHEN:   Polynomial multiplication, convolution, "count sums", n >= 10^5
TIME:   O(n log n) per multiply
SPACE:  O(n)

FFT (doubles):
  + Works with any modulus (or no modulus)
  - Precision issues for large coefficients (> ~10^14)
  - Slower constant factor

NTT (mod p):
  + Exact, fast, no precision issues
  - Only works modulo NTT-friendly primes
  - 998244353 is the standard contest prime

RECIPE:
  1. Pad both arrays to next power of 2 >= |a| + |b| - 1
  2. Forward FFT/NTT on both
  3. Pointwise multiply
  4. Inverse FFT/NTT
  5. (FFT) Round to nearest integer / (NTT) done

BITWISE VARIANTS:
  XOR convolution -> Walsh-Hadamard Transform
  OR  convolution -> Sum-over-subsets (zeta) transform
  AND convolution -> Superset-sum transform
  All O(n log n), same butterfly structure.

ARBITRARY MODULUS:
  Split coefficients into 15-bit halves, use 3-4 FFTs on doubles.

The Aha Moment

Multiplying two polynomials of degree n takes O(n2) in coefficient form -- every term of one must multiply every term of the other. But in point-value form, polynomial multiplication is just element-wise multiplication: O(n). FFT converts between these two representations in O(nlogn): evaluate at n special points (roots of unity), multiply pointwise, then interpolate back. The trick is choosing evaluation points that make the DFT matrix decompose recursively -- that's where the roots of unity come in.

NTT is the same idea, but over Z/pZ instead of C -- using primitive roots modulo a prime (like 998244353=119223+1) as "roots of unity." No floating-point errors, exact modular arithmetic.


Pattern Fingerprint

Constraint / SignalTechnique
"Multiply two polynomials" or "convolve two sequences"FFT / NTT
n106 and the naive O(n2) convolution TLEsFFT / NTT gives O(nlogn)
"Count pairs (i,j) with ai+aj=k for all k"Convolution of frequency arrays
"Multiply large numbers" (1050000 digits)Treat digits as polynomial coefficients, FFT-multiply
Answer must be modulo 998244353NTT -- this prime has 223 as a factor of p1
Answer modulo arbitrary prime (e.g., 109+7)Split coefficients + multiple NTTs, or CRT with NTT-friendly primes
"Partition function" / "subset sum count" / generating functionsPolynomial operations via FFT
Divide-and-conquer optimization with convolutionOnline FFT / divide-and-conquer FFT

Rating Progression

CF RatingWhat You Should Be Able To Do
1200Know that polynomial multiplication can be done faster than O(n2); not expected to implement
1500Recognize problems reducible to convolution (frequency array multiplication); use library FFT
1800Implement FFT/NTT from scratch; handle size padding to powers of 2; know when to use NTT vs FFT
2100Polynomial division, modular composition, online convolution; combine FFT with generating functions; handle arbitrary modulus via CRT or splitting

Before You Code Checklist

  • [ ] Pad both polynomials to the same size = next power of 2 n1+n21 -- the result polynomial has degree n1+n22, so you need that many points
  • [ ] Choose FFT (doubles) or NTT (modular) -- FFT is simpler but has precision issues for n>105 with large coefficients; NTT is exact but requires a suitable prime
  • [ ] For NTT: verify the prime supports the needed transform size -- 998244353 supports up to 223; for larger, use a different prime or CRT
  • [ ] Bit-reverse the array before the butterfly -- or use the iterative FFT with bit-reversal permutation
  • [ ] After inverse FFT: divide by n -- the inverse transform includes a 1/n factor; forgetting it gives answers scaled by n

What Would You Do If...?

Scenario 1: "The problem needs convolution modulo 109+7, not 998244353."109+71=2×5×108 -- not enough factors of 2 for NTT at large sizes. Options: (1) Use FFT with doubles if n105 and coefficients 109 (careful with precision). (2) Split coefficients into halves (each MOD), do 3-4 FFTs on doubles. (3) Use CRT: run NTT with 2-3 NTT-friendly primes, then combine results with the Chinese Remainder Theorem.

Scenario 2: "You need to multiply polynomials of degree 106 with coefficients up to 109, and the answer must be exact." Intermediate values in convolution can be as large as nC21061018=1024, which far exceeds double precision (2539×1015). You MUST use NTT, or FFT with coefficient splitting. Using plain double FFT will give wrong answers.

Scenario 3: "The problem reduces to multiplying many polynomials P1P2Pn." Don't multiply left to right -- that's O(n2logn) in the worst case. Use a priority queue (merge the two smallest polynomials first, like Huffman coding) for O(nlog2n) total. Alternatively, divide-and-conquer: multiply P1Pn/2 and Pn/2+1Pn recursively.


The Mistake That Teaches

Story: You implement NTT for the first time. You use mod 998244353, primitive root g=3. Sample tests pass. You submit a problem requiring the convolution of two arrays of size 5×105 -- Wrong Answer.

You double-check your NTT implementation: butterfly structure looks right, forward and inverse transforms are consistent. You test: multiply [1,1] by [1,1] -- get [1,2,1] OK. Multiply two random arrays of size 100 -- matches brute force OK.

Then you notice: your padded size n is 220=1048576, and the inverse NTT divides by n using inv_n = power(n, MOD-2). You print inv_n... it looks right. But wait -- you're computing power(n, MOD-2, MOD) where n is an int, and 220 fits in an int. That's fine. Except... your power function takes long long arguments but inside you wrote base *= base without % MOD. For the forward NTT, the root of unity w = power(g, (MOD-1)/n) -- you check: (MOD-1)/n for n = 1048576 gives (998244352) / 1048576 = 952. So w = power(3, 952, MOD). You print w... it's correct.

After two hours, you find it: in the butterfly loop, you wrote t = a[j + half] * w % MOD -- but a[j + half] * w overflows long long because both are up to 109, and 109×109=1018, which DOES fit in long long (<9.2×1018). So that's not the issue either.

Finally, you spot it: you forgot to take % MOD after a[j] + t. The sum of two values each up to 109 is 2×109, which overflows... no, it fits in long long. But it DOESN'T stay in [0,MOD), so subsequent multiplications CAN overflow: (2×109)2>263.

The fix: Always % MOD after BOTH addition AND subtraction in the butterfly:

cpp
long long t = a[j + half] % MOD * w % MOD;
a[j + half] = (a[j] - t + MOD) % MOD;
a[j] = (a[j] + t) % MOD;

Lesson: In modular NTT, take % MOD after every single arithmetic operation. Values that "fit in long long" today will overflow when multiplied tomorrow.


Debug This

Bug 1:

Find the bug in this FFT implementation
cpp
void fft(vector<complex<double>>& a, bool invert) {
    int n = a.size();
    // bit-reversal permutation
    for (int i = 1, j = 0; i < n; i++) {
        int bit = n >> 1;
        for (; j & bit; bit >>= 1) j ^= bit;
        j ^= bit;
        if (i < j) swap(a[i], a[j]);
    }
    // butterfly
    for (int len = 2; len <= n; len <<= 1) {
        double ang = 2 * M_PI / len * (invert ? -1 : 1);
        complex<double> wlen(cos(ang), sin(ang));
        for (int i = 0; i < n; i += len) {
            complex<double> w(1);
            for (int j = 0; j < len; j++) {  // BUG HERE
                complex<double> u = a[i+j], v = a[i+j+len/2] * w;
                a[i+j] = u + v;
                a[i+j+len/2] = u - v;
                w *= wlen;
            }
        }
    }
    if (invert) for (auto& x : a) x /= n;
}

Bug: The inner loop runs j from 0 to len - 1, but it should only go to len/2 - 1. When j >= len/2, the index i + j + len/2 goes out of the current block (accesses i + len and beyond). This reads/writes wrong elements.

Fix: Change j < len to j < len / 2.

Bug 2:

Find the bug in this NTT-based polynomial multiplication
cpp
vector<long long> multiply(vector<long long>& a, vector<long long>& b) {
    int result_size = a.size() + b.size() - 1;
    int n = 1;
    while (n < result_size) n <<= 1;
    a.resize(n); b.resize(n);

    ntt(a, false);  // forward NTT
    ntt(b, false);
    for (int i = 0; i < n; i++)
        a[i] = a[i] * b[i] % MOD;
    ntt(a, true);   // inverse NTT

    a.resize(result_size);
    return a;
}

Bug: The function modifies a and b in place (resizing them with zero-padding). If the caller reuses a or b after this call, they'll find their original data destroyed and padded with zeros. Also, a[i] * b[i] may overflow if both are close to MOD (109).

Fix: Work on copies:

cpp
vector<long long> fa(a.begin(), a.end()), fb(b.begin(), b.end());
fa.resize(n); fb.resize(n);

And for the multiplication: fa[i] = fa[i] % MOD * fb[i] % MOD;

Bug 3:

Find the bug in this convolution for counting pair sums
cpp
// Count number of pairs (i,j) with a[i] + a[j] = k, for each k
// a[i] in [0, 100000]
vector<long long> freq(100001, 0);
for (int x : a) freq[x]++;

// Convolve freq with itself
auto result = multiply(freq, freq);

// result[k] = number of ordered pairs with a[i] + a[j] = k
// But we want unordered pairs, and exclude i == j
for (int k = 0; k < result.size(); k++) {
    result[k] -= freq[k/2];       // subtract pairs where i == j ???
    result[k] /= 2;               // unordered pairs
}

Bug: The "subtract diagonal" logic is wrong. A pair (i,i) contributes to result[2 * a[i]], not result[a[i]]. To remove self-pairs, subtract freq[x] from result[2*x] for each x. The current code subtracts freq[k/2] from result[k] for ALL k, which is wrong for odd k (there are no self-pairs summing to odd values when pairing identical elements) and wrong in magnitude.

Fix:

cpp
for (int x = 0; x <= 100000; x++)
    result[2 * x] -= freq[x];     // remove i == j pairs
for (int k = 0; k < result.size(); k++)
    result[k] /= 2;               // unordered

Common Mistakes

  • Not padding to next power of 2. FFT requires array sizes to be powers of 2. The result array size must be at least deg(A) + deg(B) + 1, rounded up.
  • Precision loss with FFT over doubles. For large coefficients or many terms, use NTT (mod prime) or split into high/low halves.
  • Wrong NTT prime or generator. The prime must be of the form p = c * 2^k + 1 where 2^k >= n. Standard: 998244353 = 119 * 2^23 + 1, generator 3.
  • Forgetting to divide by n in inverse FFT. The IFFT step requires dividing each coefficient by n (or by n mod p for NTT).
  • Self-convolution diagonal error. When counting unordered pairs (i,j) with a[i]+a[j]=k, subtract self-pairs from result[2*x] for each x, then divide by 2.

Historical Origin

The Fast Fourier Transform was published by James Cooley and John Tukey in 1965, revolutionizing digital signal processing. However, historians later discovered that Carl Friedrich Gauss had described essentially the same algorithm around 1805 for interpolating asteroid orbits -- predating Cooley-Tukey by 160 years. The Number Theoretic Transform (NTT) was developed in the 1970s as the modular-arithmetic analog, avoiding floating-point entirely.


Mnemonic Anchor

"Evaluate, multiply pointwise, interpolate -- O(nlogn) beats O(n2) by splitting at roots of unity."


ConceptLink
Math Fundamentals01-math-fundamentals
Number Theory02-number-theory
Divide and Conquer14-divide-and-conquer
PracticeLadder - Mixed

Built for the climb from Codeforces 1100 to 2100.