KEY RESULT H100 · A100 · B200 · W7900 14 Models · 7 Families INT4 Fused · Triton

Memory-Traffic Saturation in
Autoregressive Transformer Decode

Decode is memory-traffic-limited — but the dominant traffic source depends on operating regime. Across 14 open-weight models and four GPU architectures, a batch-driven Hill saturation model (R² = 0.80) captures decode speedup from fused INT4 KV quantization with minimal dependence on model family, GQA ratio, or head dimension. Kernel fusion, not quantization alone, is the mechanism. A 2-minute runtime calibration test identifies models requiring asymmetric key/value precision, revealing a family-specific KV precision asymmetry absent from architectural predictors.

paper-memory-decode repo ↗ fused KV doc ↗ DualPath (arXiv:2602.21548) ↗ MoBA (arXiv:2502.13189) ↗
PREAMBLE — SET THE STAGE FIRST

Three visualizations that build the foundation

Before reading the paper findings, these three interactive explainers establish the structural facts the paper measures empirically. Each one is a prerequisite for understanding why the results below matter.

Part 1 — Structural
Autoregressive Decode Bottleneck
Per-step KV reread pressure in dense causal attention. Total reads = G·C₀ + G(G−1)/2. Shows why the loop is the bottleneck before any hardware is involved.
Part 2 — Empirical
KV-Memory Traffic Governs Decode
Decode scaling across GPU classes. Hill-type saturation law, pipeline comparisons, bandwidth plateau behavior. The empirical counterpart to Part 1's structural argument.
Part 3 — Statistical
Spearman ρ — Rank Correlation
The paper uses Spearman rank correlation to test whether architectural features predict KV quantization sensitivity. They do not: ρ < 0.2, p > 0.3 across 14 models. Understand what that means here.
P1 Batch-Driven Saturation — One Number Governs Decode

Across 767 decode-mode configurations with B ≥ 4, a single Hill-type model fitted on batch size alone explains 80% of variance in P5/P0 speedup ratio across 14 architectures on H100. Context length, KV bytes per token, GQA ratio, and head dimension are weak predictors after accounting for batch. Architecture does not fundamentally change the workload — memory-traffic physics does.

S(B) = Smax · Bγ / (B½γ + Bγ)   →   Smax=3.75 · B½=5.1 · γ=1.32 · R²=0.80
Implication
80% of asymptotic speedup is reached at B ≥ 16. 95% at B ≥ 64. Batch size is the dominant deployment knob — not architecture, not GQA ratio, not head dimension.
Batch size
Predicted speedup
% of max
Wang et al. 2025 — "A Systematic Characterization of LLM Inference on GPUs" confirms decode exhibits substantially higher memory bandwidth utilization than prefill — arXiv:2512.01644
P2 Fusion Is the Mechanism — Not Quantization Itself

Six pipelines were developed. The critical comparison is P0 (FP16 baseline) vs P1 (non-fused INT4) vs P5 (fused INT4). A non-fused pipeline reads INT4, dequantizes into an intermediate FP16 buffer, writes that buffer back, then calls attention. The extra write largely negates the savings from reading smaller values. The fused kernel dequantizes inside the attention loop — unpacked values live only in registers, never touching global memory. This is analogous to FlashAttention's tile-and-discard strategy.

P0 — baseline FP16 KV cache · PyTorch SDPA · no quantization 1.0×
P1 — non-fused INT4 KV → dequant → FP16 buffer → SDPA. The buffer write negates savings. 0.5× (slowdown)
P3 — fused Hardware-aware fused INT4 Triton kernel · RDNA3 wavefront tiling (BLOCK_N=128) 2.7–7.2×
P5 — unified Production kernel · register tiling · warp-level reductions · unified decode+prefill paths 2.7–4.8× (H100) · 1.6–7.2× (W7900)
Critical Systems Insight
Quantization without fusion is counterproductive. Any deployment that dequantizes KV cache into a temporary buffer before calling attention will see no benefit and likely a regression — validated on both NVIDIA Hopper and AMD RDNA 3. The speedup comes entirely from eliminating intermediate memory traffic.
Why speedup < 4×. INT4 compresses KV reads by 4×, but not all decode traffic is KV. Let f be the fraction of total traffic from KV reads:
Speedup ≈ 1 / (f/4 + (1−f)) = 1 / (1 − 3f/4)    At f = 0.85 → 3.6× · matches Smax ≈ 3.75
Full four-pipeline validation on AMD W7900 (RDNA 3) via Triton ROCm backend — fused kernel achieves 1.6–7.2× with full numerical fidelity (max absolute error < 4.9×10⁻⁴). P1 slowdown confirmed cross-vendor.
P3 Traffic Regime: Weight-Bound → KV-Bound Transition

The widely cited claim that decode is "KV-bandwidth bound" is incomplete at 7B scale. Direct bandwidth decomposition of the verification step on A100 shows that at 7B with 2K context, model weight reads constitute 94–99% of per-step bandwidth while KV reads are under 2%. The KV-bound regime emerges only at longer contexts, larger batch sizes, or larger models.

Weight-bound regime
94–99%
weights dominate
small model, short ctx
INT4 KV → modest benefit
KV-bound regime
→ 50%+
KV cache dominates
long ctx · large batch · large model
fused INT4 → full speedup
→ batch size · context length · model scale increase →
GQA crossover point
Qwen2.5-7B uses only 4 KV heads vs 8 for Mistral/Llama, halving its KV traffic per step. At 8K context, Qwen's KV fraction (2.99%) is less than half of Mistral's (6.90%). The crossover to KV-dominated bandwidth occurs at shorter contexts for models with more KV heads.
Wu et al. 2026 — "DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference" — external KV-cache storage I/O in agentic disaggregated serving; ~95–98% token reuse rates — arXiv:2602.21548
P4 KV Precision Asymmetry — Keys and Values Are Not the Same

Direct KV activation quantization experiments reveal a striking asymmetry: values consistently tolerate INT4 across all tested models. Keys exhibit model-dependent precision floors. On Qwen2.5-7B, compressing keys to INT4 (with INT8 values) causes catastrophic collapse — 17,681% PPL increase, 35,000:1 sensitivity asymmetry. On Mistral-7B, the same configuration causes only +1.23% PPL with 0.98 token agreement. This is a phase transition, not gradual degradation: the Qwen precision cliff is at INT7↔INT6 boundary.

SENSITIVE Qwen2.5-7B

K / V bitsΔPPLseverityagree

TOLERANT Mistral-7B

K / V bitsΔPPLseverityagree
Why architectural predictors fail
No architectural feature predicts key quantization sensitivity: GQA ratio (ρ=−0.18), RoPE θ (ρ=0.01), attention entropy (r=0.29 at 6 models, down from r=0.89 at 3). Fisher information and covariance-based signals show no significant correlation (Spearman ρ < 0.2, p > 0.3). Sensitivity is a family-specific learned characteristic, not an architectural property. See Spearman ρ explainer.
2-minute runtime calibration test
Run 5 calibration prompts through INT8 and INT6 key configs (with INT4 values). Compute mean logit error ratio INT6/INT8. Threshold τ = 3.0 achieves 100% accuracy on 13/13 evaluable models. Qwen family: ratio 5.07–5.40 (flag as sensitive). All others: ratio < 2.2 (safe for INT4–INT6 keys). Generalizes to 70B+ scale with zero false positives.
Scale attenuation. Qwen key sensitivity attenuates with model scale: catastrophic at 7B (>10% PPL), +49% at 32B, +1.55% at 72B — below practical threshold. Larger models distribute quantization error across more layers. KFP16/VINT4 remains safe at all Qwen scales (+0.31% at 72B).
Consistent with KIVI (Liu et al. 2024) which observes channel-wise outlier structure differences between keys and values. Our finding extends from distribution asymmetry to minimum viable bit-width asymmetry. KIVI: arXiv:2402.02750
P5 Cross-GPU Validation — Four Matched Lanes

The same qualitative pattern holds across W7900, A100, H100, and B200. Throughput ordering follows hardware memory-system strength. Latency remains approximately linear in context length. Cross-platform variation is explained better by sustained decode-bandwidth plateau than by headline peak HBM bandwidth. B200 has 1.96× the H100 peak bandwidth but decode throughput gain is smaller — practical kernel bandwidth at decode batch sizes remains well below peak on both architectures.

GPU Memory Peak BW Tok/s (B=1, T=4096) Tok/s (B=8, T=4096) Batch response Max context (B=1)
AMD W7900 48 GB 864 GB/s 1.2k 1.3k near-linear 32K
NVIDIA A100 80 GB 2039 GB/s 12.7k 15.6k early saturation 32K
NVIDIA H100 80 GB 3350 GB/s 20.1k 37.1k saturating 32K
NVIDIA B200 178 GB 6550 GB/s 24.7k 36.5k saturating 384K
B200 extreme context
178 GB HBM enables 384K-token contexts at B=1 (149 GB peak memory, 20.5 ms decode latency) — a 12× improvement over the 32K practical limit on H100 and W7900. HBM capacity is the primary constraint for long-context inference. INT4 KV compression could extend this beyond 512K tokens on the same hardware.
P6 Speculation × Quantization — Sub- and Super-Multiplicative Composition

Speculative decoding does not relax bandwidth constraints. The verification step remains a full attention pass over the entire KV cache. Composition with KV quantization is: sub-multiplicative (ρ ≈ 0.62) for aggressively grouped models (Qwen, 4 KV heads) where KV is a small fraction of total bandwidth, and super-multiplicative (ρ up to 1.95) at long context for models with more KV heads (Llama, 8 KV heads) where KV becomes the bottleneck.

ρ = Scombined / (Squant × Sspec)    ρ < 1.0 sub-mult · ρ = 1.0 mult · ρ > 1.0 super-mult
Batch kills speculation faster than context
At B=1, acceptance ≥ 0.81 through 8K context. At B=16, acceptance is already below 0.65 at 2K context for two of three models. Production serving (typically B ≥ 8) operates precisely in the regime where n-gram speculation provides diminishing returns.
Tree verification paradox
Tree speculation helps at low acceptance (+21% at α=0.3) but hurts at high acceptance (−17% at α=0.9). Use linear speculation at B ≤ 4. Consider tree only at B ≥ 16 with context > 8K where acceptance < 0.5.
P7 Memory Tiering — No Current Interconnect Is Close

Under synchronous dense attention, KV cache must reside in HBM for real-time decode. At T=128K, serving overflowed KV from a secondary tier at 10 ms decode latency requires 10–50 TB/s bandwidth depending on model size. All current interconnects fall orders of magnitude short.

InterconnectPeak BWShortfall at 128K (7B)Shortfall at 128K (24B)
HBM3 (H100)3,350 GB/sin-HBM targetin-HBM target
NVLink900 GB/s~11× insufficient~55× insufficient
PCIe 5 ×1664 GB/s~156× insufficient~781× insufficient
CXL 2.032 GB/s~312× insufficient~1,562× insufficient
Caveat — advanced inference techniques
These projections assume dense attention. Speculative decoding reduces full KV reads per generated token but does not reduce the instantaneous bandwidth of the verification kernel. At realistic acceptance rates (k=2–4), required bandwidth at T=128K is still 2.5–25 TB/s, exceeding NVLink by 3–28×. Sparse attention (MoBA, sliding window) could reduce per-step traffic — quantifying that interaction remains future work.

Systems like DualPath, LMCache, and CXL-SpecKV convert the tiering constraint from instantaneous bandwidth to a scheduling problem via pipelined transfers.
P8 Future Directions — Reduce KV Access, Not Just Compress It

The paper's closing argument: compression and kernel engineering are the current best path, but future architectures should address the KV bandwidth bottleneck more directly — constraining KV access proportional to available memory bandwidth rather than compressing the entire cache. Two concrete external directions that motivate this:

MoBA — arXiv:2502.13189
Mixture of Block Attention for Long-Context LLMs
Applies MoE routing to the attention mechanism. Each query attends to a small subset of KV blocks, reducing complexity toward near-linear. Allows seamless transition between full and sparse attention. Deployed in Kimi's long-context serving. Directly reduces per-step KV reads — the architectural-level complement to kernel-level quantization.

Connection: MoBA reduces f (KV traffic fraction) structurally. The paper's speedup model predicts this directly increases compression benefit: lower f → higher effective speedup from KV reduction.
DualPath — arXiv:2602.21548
Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference
Addresses external KV-cache storage I/O in agentic disaggregated serving. ~95–98% token reuse rates in multi-turn workloads means warm caches, but per-step KV reads inside the attention kernel are unchanged. DualPath provides the system-level scheduling complement to this paper's kernel-level result.

Connection: This paper establishes the instantaneous bandwidth requirement that DualPath-style systems must route around. The tiering projections (P7) are the quantitative input to that scheduling problem.
AKP — Adaptive KV Precision
Per-Tensor Precision Policies Based on Architecture-Specific Floors
The KV precision asymmetry discovery directly motivates AKP: rather than uniform quantization, select per-tensor precision based on the ratio classifier result. Combine with attention mechanisms that constrain KV access proportional to available memory bandwidth. The fused kernel's natural support for asymmetric precision (K and V loaded at independent formats) provides the implementation substrate.
Open Questions
What Remains Unknown
Qwen root cause: The key fragility is family-specific and attenuates at 72B. Does it disappear entirely at larger scale? What in the training pipeline creates the INT7 floor?

Sparse attention interaction: MoBA and sliding-window attention reduce per-step KV traffic. How does that change the tiering bandwidth requirement and the quantization–speculation composition ratio?

Multi-GPU tensor parallel: All results are single-GPU. Tensor parallelism changes the bandwidth-to-compute ratio and may shift the saturation model parameters.
Practical deployment recipe: For 12 of 14 tested models — uniform INT4 KV quantization with the fused P5 kernel. No per-model tuning. 2.7–4.8× decode speedup on H100. For Qwen family (7B–32B) — KFP16/VINT4 or KINT8/VINT4. 37–62% bandwidth reduction preserving quality. Determine which to use with the 2-minute ratio classifier.
Code: ratio_classifier.py · triton_kernels.py