C₀ = H + A (initial cached context, already in HBM at decode start):
4352
Total read positions over G steps:
—
P1Autoregressive Decode Loop
Attention over full prior KV each step.
Decode step g reads C₀ + (g−1) positions: the initial cached context C₀ = H+A
plus all previously generated tokens. The loop is causal — each step depends on all prior outputs.
Wang et al. 2025 — A Systematic Characterization of LLM Inference on GPUs — arXiv:2512.01644
Wang et al. 2025 — A Systematic Characterization of LLM Inference on GPUs — arXiv:2512.01644
P2Autoregressive Reads — Toy Schematic (4 cached + 8 generated)
initial cached C₀ KV (reread every step)
prior generated KV (reread)
Q_t — token being generated (no KV read of self)
future (not yet generated)
Reads at step g (1-based): C₀ + (g − 1) · C₀ = H+A already in cache
Total read positions over G steps: G·C₀ + G(G−1)/2 ≡ Θ(G·C₀ + G²)
H = history · A = appended · G = generated · C₀ = H+A
Schematic shows 4 cached + 8 generated tokens. Live totals use slider values.
The last generated token (Q_g) forms a key/value only after output — it is not reread in the step that creates it.
P3HBM Bandwidth Demand vs Active Context (Attention-Side)
Required HBM bandwidth = activeContext × kvBytesPerToken × tokensPerSecond.
Active context at step g = C₀ + (g−1). Curve shown at representative step g=1 (lower bound); demand grows linearly through decode.
KV bytes/token depends on architecture, layers, head count, head dim, and dtype — use the slider to match your model.
Wang et al. 2025 — arXiv:2512.01644 · HBM specs: A100 ~2 TB/s, H100 ~3.35 TB/s, H200 ~4.8 TB/s
Wang et al. 2025 — arXiv:2512.01644 · HBM specs: A100 ~2 TB/s, H100 ~3.35 TB/s, H200 ~4.8 TB/s
▲ TRIANGULAR WEDGE OF DOOM
P4KV Reread Frequency — Autoregressive Read Accumulation
log-scale:
0
max
TOTAL REREADS PER GENERATED TOKEN POSITION
First generated token is reread G−1 times.
Last generated token is reread 0 times.
Each initial cached C₀ token is reread G times (shown as separate band above).
BANDWIDTH SHARE PER GENERATED TOKEN — THE BRUTAL PICTURE
First generated token consumes (G−1)× more bandwidth than the last.
The last generated token contributes zero rereads — it is only ever written.
This is the graph that stops reviewers from arguing.
Step g reads C₀ + (g−1) prior positions. ·
Total read positions: G·C₀ + G(G−1)/2 ≡ Θ(G·C₀ + G²).
The C₀ rectangle (initial cached context) dominates unless G ≈ C₀.
Autoregressive KV reread region: rectangle for C₀ (every initial token read G times) plus strict lower triangle for generated-token rereads.
[Wang et al. 2025 — arXiv:2512.01644]
P5 — APPENDIX Agentic / Disaggregated Serving: KV Reuse & External Bandwidth
This panel covers system-level amplifiers that are separate from the intrinsic causal attention bottleneck above.
Offloaded KV (PCIe/NVMe), agentic multi-turn reuse, and disaggregated prefill/decode (PD) are real workload effects —
but they sit on top of the structural mechanism in P1–P4, not inside it.
~95–98%
KV token reuse rate
in agentic multi-turn workloads
[Wu et al. 2026 — DualPath]
in agentic multi-turn workloads
[Wu et al. 2026 — DualPath]
∞
Total read positions
G·C₀ + G(G−1)/2
(live from sliders)
G·C₀ + G(G−1)/2
(live from sliders)
C₀ = H+A: —+— = — · G: 128
Total read positions = G·C₀ + G(G−1)/2 = —
Total read positions = G·C₀ + G(G−1)/2 = —
High reuse tells you the cache is warm and in HBM. It does not eliminate per-step KV reads in the attention kernel.
DualPath addresses external storage I/O bandwidth for agentic disaggregated serving — a separate bottleneck.
Wu et al. 2026 — DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference — arXiv:2602.21548
Wu et al. 2026 — DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference — arXiv:2602.21548
P6Architecture Comparison: Same Autoregressive Loop, Different KV Compression
| Architecture | Attention type | KV per token (relative) | KV compression ratio | Still autoregressive? | Bar (KV size) |
|---|
Math note (all ratios illustrative / model-specific):
MHA stores K,V for every head: KV_bytes = 2 × L × n_heads × d_head × C × sizeof(dtype).
GQA shares K,V across Q-head groups, reducing by n_heads / n_kv_groups.
MLA (DeepSeek-V2/V3) projects KV into a compressed latent c_KV ≪ d_model
and up-projects at attention time; DeepSeek-V2 reports ~93.3% KV-cache reduction for that model family.
The autoregressive reread happens every step regardless — architecture only changes how many bytes are reread.
The decode loop structure is identical across all architectures.
What varies is the KV payload per token.
Compression ratios shown are illustrative presets, not universal constants — verify against your model's config.
DeepSeek-V2: arXiv:2405.04434 · Wang et al. 2025: arXiv:2512.01644
DeepSeek-V2: arXiv:2405.04434 · Wang et al. 2025: arXiv:2512.01644
▶ SOURCES
Wang et al. 2025 — "A Systematic Characterization of LLM Inference on GPUs" — attention decode kernels are memory-bandwidth bound; bottleneck dominance is workload- and context-dependent
https://arxiv.org/pdf/2512.01644
Wu et al. 2026 — "DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference" — external KV-cache storage I/O in agentic disaggregated serving; ~95–98% token reuse rates
https://arxiv.org/pdf/2602.21548
DeepSeek-V2 2024 — MLA (Multi-head Latent Attention) compression architecture; ~93.3% KV-cache reduction for DeepSeek-V2 model family (illustrative, model-specific)
https://arxiv.org/abs/2405.04434