Autoregressive Decode Bottleneck — Structural Inevitability

⚡

In dense causal attention, each decode step forms one new query and reads all C₀ + (g−1) prior K/V positions — the initial cached context plus every previously generated token. This figure isolates that attention-side effect only. Offloaded KV, PD disaggregation, and agentic cache reuse are separate system-level amplifiers treated in the appendix below. Hardware characterization confirms decode attention kernels are memory-bandwidth bound [Wang et al. 2025 — arXiv:2512.01644].

History tokens (H): 4096

Append tokens (A): 256

Generated tokens (G): 128

Highlight step (P1 & P2): 0 (click P1 to auto-play)

C₀ = H + A (initial cached context, already in HBM at decode start): 4352 Total read positions over G steps: —

P1Autoregressive Decode Loop

Attention over full prior KV each step. Decode step g reads C₀ + (g−1) positions: the initial cached context C₀ = H+A plus all previously generated tokens. The loop is causal — each step depends on all prior outputs.
Wang et al. 2025 — A Systematic Characterization of LLM Inference on GPUs — arXiv:2512.01644

P2Autoregressive Reads — Toy Schematic (4 cached + 8 generated)

initial cached C₀ KV (reread every step)

prior generated KV (reread)

Q_t — token being generated (no KV read of self)

future (not yet generated)

Reads at step g (1-based): C₀ + (g − 1) · C₀ = H+A already in cache

Total read positions over G steps: G·C₀ + G(G−1)/2 ≡ Θ(G·C₀ + G²)

H = history · A = appended · G = generated · C₀ = H+A

Schematic shows 4 cached + 8 generated tokens. Live totals use slider values. The last generated token (Q_g) forms a key/value only after output — it is not reread in the step that creates it.

P3HBM Bandwidth Demand vs Active Context (Attention-Side)

KV bytes/token: 128

Target decode rate (tok/s): 50

show offloaded-KV tiers (PCIe / NVMe)

Required HBM bandwidth = activeContext × kvBytesPerToken × tokensPerSecond. Active context at step g = C₀ + (g−1). Curve shown at representative step g=1 (lower bound); demand grows linearly through decode. KV bytes/token depends on architecture, layers, head count, head dim, and dtype — use the slider to match your model.
Wang et al. 2025 — arXiv:2512.01644 · HBM specs: A100 ~2 TB/s, H100 ~3.35 TB/s, H200 ~4.8 TB/s

▲ TRIANGULAR WEDGE OF DOOM

P4KV Reread Frequency — Autoregressive Read Accumulation

log-scale:

0 max

TOTAL REREADS PER GENERATED TOKEN POSITION

First generated token is reread G−1 times. Last generated token is reread 0 times. Each initial cached C₀ token is reread G times (shown as separate band above).

BANDWIDTH SHARE PER GENERATED TOKEN — THE BRUTAL PICTURE

First generated token consumes (G−1)× more bandwidth than the last. The last generated token contributes zero rereads — it is only ever written. This is the graph that stops reviewers from arguing.

Step g reads C₀ + (g−1) prior positions. · Total read positions: G·C₀ + G(G−1)/2 ≡ Θ(G·C₀ + G²). The C₀ rectangle (initial cached context) dominates unless G ≈ C₀.

Autoregressive KV reread region: rectangle for C₀ (every initial token read G times) plus strict lower triangle for generated-token rereads. [Wang et al. 2025 — arXiv:2512.01644]

P5 — APPENDIX Agentic / Disaggregated Serving: KV Reuse & External Bandwidth

This panel covers system-level amplifiers that are separate from the intrinsic causal attention bottleneck above. Offloaded KV (PCIe/NVMe), agentic multi-turn reuse, and disaggregated prefill/decode (PD) are real workload effects — but they sit on top of the structural mechanism in P1–P4, not inside it.

~95–98%

KV token reuse rate
in agentic multi-turn workloads
[Wu et al. 2026 — DualPath]

∞

Total read positions
G·C₀ + G(G−1)/2
(live from sliders)

C₀ = H+A: —+— = — · G: 128
Total read positions = G·C₀ + G(G−1)/2 = —

High reuse tells you the cache is warm and in HBM. It does not eliminate per-step KV reads in the attention kernel. DualPath addresses external storage I/O bandwidth for agentic disaggregated serving — a separate bottleneck.
Wu et al. 2026 — DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference — arXiv:2602.21548

P6Architecture Comparison: Same Autoregressive Loop, Different KV Compression

Architecture	Attention type	KV per token (relative)	KV compression ratio	Still autoregressive?	Bar (KV size)

Math note (all ratios illustrative / model-specific): MHA stores K,V for every head: KV_bytes = 2 × L × n_heads × d_head × C × sizeof(dtype). GQA shares K,V across Q-head groups, reducing by n_heads / n_kv_groups. MLA (DeepSeek-V2/V3) projects KV into a compressed latent c_KV ≪ d_model and up-projects at attention time; DeepSeek-V2 reports ~93.3% KV-cache reduction for that model family. The autoregressive reread happens every step regardless — architecture only changes how many bytes are reread.

The decode loop structure is identical across all architectures. What varies is the KV payload per token. Compression ratios shown are illustrative presets, not universal constants — verify against your model's config.
DeepSeek-V2: arXiv:2405.04434 · Wang et al. 2025: arXiv:2512.01644

▶ SOURCES

Wang et al. 2025 — "A Systematic Characterization of LLM Inference on GPUs" — attention decode kernels are memory-bandwidth bound; bottleneck dominance is workload- and context-dependent https://arxiv.org/pdf/2512.01644

Wu et al. 2026 — "DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference" — external KV-cache storage I/O in agentic disaggregated serving; ~95–98% token reuse rates https://arxiv.org/pdf/2602.21548

DeepSeek-V2 2024 — MLA (Multi-head Latent Attention) compression architecture; ~93.3% KV-cache reduction for DeepSeek-V2 model family (illustrative, model-specific) https://arxiv.org/abs/2405.04434