ATTENTION DECODE

Attention-Side Decode Bottleneck

Per-step KV reread pressure in dense causal attention. End-to-end dominance still depends on workload and serving architecture.

memory-bandwidth-bound KV cache rereads Total reads = G·C₀ + G(G−1)/2 MHA · GQA · MLA

In dense causal attention, each decode step forms one new query and reads all C₀ + (g−1) prior K/V positions — the initial cached context plus every previously generated token. This figure isolates that attention-side effect only. Offloaded KV, PD disaggregation, and agentic cache reuse are separate system-level amplifiers treated in the appendix below. Hardware characterization confirms decode attention kernels are memory-bandwidth bound [Wang et al. 2025 — arXiv:2512.01644].

C₀ = H + A (initial cached context, already in HBM at decode start): 4352 Total read positions over G steps:
P1Autoregressive Decode Loop
Attention over full prior KV each step. Decode step g reads C₀ + (g−1) positions: the initial cached context C₀ = H+A plus all previously generated tokens. The loop is causal — each step depends on all prior outputs.
Wang et al. 2025 — A Systematic Characterization of LLM Inference on GPUs — arXiv:2512.01644
P2Autoregressive Reads — Toy Schematic (4 cached + 8 generated)
initial cached C₀ KV (reread every step)
prior generated KV (reread)
Q_t — token being generated (no KV read of self)
future (not yet generated)
Reads at step g (1-based): C₀ + (g − 1)  ·  C₀ = H+A already in cache
Total read positions over G steps: G·C₀ + G(G−1)/2  ≡  Θ(G·C₀ + G²)
H = history  ·  A = appended  ·  G = generated  ·  C₀ = H+A
Schematic shows 4 cached + 8 generated tokens. Live totals use slider values. The last generated token (Q_g) forms a key/value only after output — it is not reread in the step that creates it.
P3HBM Bandwidth Demand vs Active Context (Attention-Side)
Required HBM bandwidth = activeContext × kvBytesPerToken × tokensPerSecond. Active context at step g = C₀ + (g−1). Curve shown at representative step g=1 (lower bound); demand grows linearly through decode. KV bytes/token depends on architecture, layers, head count, head dim, and dtype — use the slider to match your model.
Wang et al. 2025 — arXiv:2512.01644 · HBM specs: A100 ~2 TB/s, H100 ~3.35 TB/s, H200 ~4.8 TB/s
▲ TRIANGULAR WEDGE OF DOOM
P4KV Reread Frequency — Autoregressive Read Accumulation
log-scale:
0 max
TOTAL REREADS PER GENERATED TOKEN POSITION
First generated token is reread G−1 times. Last generated token is reread 0 times. Each initial cached C₀ token is reread G times (shown as separate band above).
BANDWIDTH SHARE PER GENERATED TOKEN — THE BRUTAL PICTURE
First generated token consumes (G−1)× more bandwidth than the last. The last generated token contributes zero rereads — it is only ever written. This is the graph that stops reviewers from arguing.
Step g reads C₀ + (g−1) prior positions.  ·  Total read positions: G·C₀ + G(G−1)/2  ≡  Θ(G·C₀ + G²). The C₀ rectangle (initial cached context) dominates unless G ≈ C₀.
Autoregressive KV reread region: rectangle for C₀ (every initial token read G times) plus strict lower triangle for generated-token rereads. [Wang et al. 2025 — arXiv:2512.01644]
P5 — APPENDIX Agentic / Disaggregated Serving: KV Reuse & External Bandwidth
This panel covers system-level amplifiers that are separate from the intrinsic causal attention bottleneck above. Offloaded KV (PCIe/NVMe), agentic multi-turn reuse, and disaggregated prefill/decode (PD) are real workload effects — but they sit on top of the structural mechanism in P1–P4, not inside it.
~95–98%
KV token reuse rate
in agentic multi-turn workloads
[Wu et al. 2026 — DualPath]
Total read positions
G·C₀ + G(G−1)/2
(live from sliders)
C₀ = H+A: + =  ·  G: 128
Total read positions = G·C₀ + G(G−1)/2 =
High reuse tells you the cache is warm and in HBM. It does not eliminate per-step KV reads in the attention kernel. DualPath addresses external storage I/O bandwidth for agentic disaggregated serving — a separate bottleneck.
Wu et al. 2026 — DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference — arXiv:2602.21548
P6Architecture Comparison: Same Autoregressive Loop, Different KV Compression
Architecture Attention type KV per token (relative) KV compression ratio Still autoregressive? Bar (KV size)
Math note (all ratios illustrative / model-specific): MHA stores K,V for every head: KV_bytes = 2 × L × n_heads × d_head × C × sizeof(dtype). GQA shares K,V across Q-head groups, reducing by n_heads / n_kv_groups. MLA (DeepSeek-V2/V3) projects KV into a compressed latent c_KV ≪ d_model and up-projects at attention time; DeepSeek-V2 reports ~93.3% KV-cache reduction for that model family. The autoregressive reread happens every step regardless — architecture only changes how many bytes are reread.
The decode loop structure is identical across all architectures. What varies is the KV payload per token. Compression ratios shown are illustrative presets, not universal constants — verify against your model's config.
DeepSeek-V2: arXiv:2405.04434 · Wang et al. 2025: arXiv:2512.01644

▶ SOURCES

Wang et al. 2025 — "A Systematic Characterization of LLM Inference on GPUs" — attention decode kernels are memory-bandwidth bound; bottleneck dominance is workload- and context-dependent https://arxiv.org/pdf/2512.01644
Wu et al. 2026 — "DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference" — external KV-cache storage I/O in agentic disaggregated serving; ~95–98% token reuse rates https://arxiv.org/pdf/2602.21548
DeepSeek-V2 2024 — MLA (Multi-head Latent Attention) compression architecture; ~93.3% KV-cache reduction for DeepSeek-V2 model family (illustrative, model-specific) https://arxiv.org/abs/2405.04434