Memory-Traffic Saturation in Autoregressive Transformer Decode
Decode is memory-traffic-limited — but the dominant traffic source depends on operating regime.
Across 14 open-weight models and four GPU architectures, a batch-driven Hill saturation model
(R² = 0.80) captures decode speedup from fused INT4 KV quantization with minimal dependence
on model family, GQA ratio, or head dimension. Kernel fusion, not quantization alone, is the mechanism.
A 2-minute runtime calibration test identifies models requiring asymmetric key/value precision,
revealing a family-specific KV precision asymmetry absent from architectural predictors.
Before reading the paper findings, these three interactive explainers establish the structural facts
the paper measures empirically. Each one is a prerequisite for understanding why the results below matter.
P1Batch-Driven Saturation — One Number Governs Decode
Across 767 decode-mode configurations with B ≥ 4, a single Hill-type model fitted on batch size alone
explains 80% of variance in P5/P0 speedup ratio across 14 architectures on H100.
Context length, KV bytes per token, GQA ratio, and head dimension are weak predictors after accounting for batch.
Architecture does not fundamentally change the workload — memory-traffic physics does.
80% of asymptotic speedup is reached at B ≥ 16. 95% at B ≥ 64.
Batch size is the dominant deployment knob — not architecture, not GQA ratio, not head dimension.
Batch size
Predicted speedup—
% of max—
Wang et al. 2025 — "A Systematic Characterization of LLM Inference on GPUs" confirms decode exhibits substantially higher memory bandwidth utilization than prefill — arXiv:2512.01644
P2Fusion Is the Mechanism — Not Quantization Itself
Six pipelines were developed. The critical comparison is P0 (FP16 baseline) vs P1 (non-fused INT4) vs P5 (fused INT4).
A non-fused pipeline reads INT4, dequantizes into an intermediate FP16 buffer,
writes that buffer back, then calls attention. The extra write largely negates the savings from reading smaller values.
The fused kernel dequantizes inside the attention loop — unpacked values live only in registers, never touching global memory.
This is analogous to FlashAttention's tile-and-discard strategy.
Quantization without fusion is counterproductive. Any deployment that dequantizes KV cache into a temporary
buffer before calling attention will see no benefit and likely a regression — validated on both NVIDIA Hopper and AMD RDNA 3.
The speedup comes entirely from eliminating intermediate memory traffic.
Why speedup < 4×. INT4 compresses KV reads by 4×, but not all decode traffic is KV.
Let f be the fraction of total traffic from KV reads:
The widely cited claim that decode is "KV-bandwidth bound" is incomplete at 7B scale.
Direct bandwidth decomposition of the verification step on A100 shows that at 7B with 2K context,
model weight reads constitute 94–99% of per-step bandwidth while KV reads are under 2%.
The KV-bound regime emerges only at longer contexts, larger batch sizes, or larger models.
Weight-bound regime
94–99%
weights dominate small model, short ctx INT4 KV → modest benefit
KV-bound regime
→ 50%+
KV cache dominates long ctx · large batch · large model fused INT4 → full speedup
Qwen2.5-7B uses only 4 KV heads vs 8 for Mistral/Llama, halving its KV traffic per step.
At 8K context, Qwen's KV fraction (2.99%) is less than half of Mistral's (6.90%).
The crossover to KV-dominated bandwidth occurs at shorter contexts for models with more KV heads.
Wu et al. 2026 — "DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference" — external KV-cache storage I/O in agentic disaggregated serving; ~95–98% token reuse rates —
arXiv:2602.21548
P4KV Precision Asymmetry — Keys and Values Are Not the Same
Direct KV activation quantization experiments reveal a striking asymmetry:
values consistently tolerate INT4 across all tested models.
Keys exhibit model-dependent precision floors. On Qwen2.5-7B, compressing keys to INT4 (with INT8 values)
causes catastrophic collapse — 17,681% PPL increase, 35,000:1 sensitivity asymmetry.
On Mistral-7B, the same configuration causes only +1.23% PPL with 0.98 token agreement.
This is a phase transition, not gradual degradation: the Qwen precision cliff is at INT7↔INT6 boundary.
SENSITIVE Qwen2.5-7B
K / V bitsΔPPLseverityagree
TOLERANT Mistral-7B
K / V bitsΔPPLseverityagree
Why architectural predictors fail
No architectural feature predicts key quantization sensitivity: GQA ratio (ρ=−0.18), RoPE θ (ρ=0.01),
attention entropy (r=0.29 at 6 models, down from r=0.89 at 3). Fisher information and covariance-based signals
show no significant correlation (Spearman ρ < 0.2, p > 0.3). Sensitivity is a family-specific learned characteristic,
not an architectural property. See Spearman ρ explainer.
2-minute runtime calibration test
Run 5 calibration prompts through INT8 and INT6 key configs (with INT4 values).
Compute mean logit error ratio INT6/INT8. Threshold τ = 3.0 achieves 100% accuracy on 13/13 evaluable models.
Qwen family: ratio 5.07–5.40 (flag as sensitive). All others: ratio < 2.2 (safe for INT4–INT6 keys).
Generalizes to 70B+ scale with zero false positives.
Scale attenuation. Qwen key sensitivity attenuates with model scale:
catastrophic at 7B (>10% PPL), +49% at 32B, +1.55% at 72B — below practical threshold.
Larger models distribute quantization error across more layers. KFP16/VINT4 remains safe at all Qwen scales (+0.31% at 72B).
Consistent with KIVI (Liu et al. 2024) which observes channel-wise outlier structure differences between keys and values.
Our finding extends from distribution asymmetry to minimum viable bit-width asymmetry.
KIVI: arXiv:2402.02750
P5Cross-GPU Validation — Four Matched Lanes
The same qualitative pattern holds across W7900, A100, H100, and B200.
Throughput ordering follows hardware memory-system strength. Latency remains approximately linear in context length.
Cross-platform variation is explained better by sustained decode-bandwidth plateau than by headline peak HBM bandwidth.
B200 has 1.96× the H100 peak bandwidth but decode throughput gain is smaller — practical kernel bandwidth at decode batch sizes
remains well below peak on both architectures.
GPU
Memory
Peak BW
Tok/s (B=1, T=4096)
Tok/s (B=8, T=4096)
Batch response
Max context (B=1)
AMD W7900
48 GB
864 GB/s
1.2k
1.3k
near-linear
32K
NVIDIA A100
80 GB
2039 GB/s
12.7k
15.6k
early saturation
32K
NVIDIA H100
80 GB
3350 GB/s
20.1k
37.1k
saturating
32K
NVIDIA B200
178 GB
6550 GB/s
24.7k
36.5k
saturating
384K
B200 extreme context
178 GB HBM enables 384K-token contexts at B=1 (149 GB peak memory, 20.5 ms decode latency) —
a 12× improvement over the 32K practical limit on H100 and W7900.
HBM capacity is the primary constraint for long-context inference.
INT4 KV compression could extend this beyond 512K tokens on the same hardware.
P6Speculation × Quantization — Sub- and Super-Multiplicative Composition
Speculative decoding does not relax bandwidth constraints. The verification step remains a full attention pass over
the entire KV cache. Composition with KV quantization is:
sub-multiplicative (ρ ≈ 0.62) for aggressively grouped models (Qwen, 4 KV heads) where
KV is a small fraction of total bandwidth, and super-multiplicative (ρ up to 1.95) at long context
for models with more KV heads (Llama, 8 KV heads) where KV becomes the bottleneck.
At B=1, acceptance ≥ 0.81 through 8K context. At B=16, acceptance is already below 0.65 at 2K context for two of three models.
Production serving (typically B ≥ 8) operates precisely in the regime where n-gram speculation provides diminishing returns.
Tree verification paradox
Tree speculation helps at low acceptance (+21% at α=0.3) but hurts at high acceptance (−17% at α=0.9).
Use linear speculation at B ≤ 4. Consider tree only at B ≥ 16 with context > 8K where acceptance < 0.5.
P7Memory Tiering — No Current Interconnect Is Close
Under synchronous dense attention, KV cache must reside in HBM for real-time decode.
At T=128K, serving overflowed KV from a secondary tier at 10 ms decode latency requires
10–50 TB/s bandwidth depending on model size.
All current interconnects fall orders of magnitude short.
Interconnect
Peak BW
Shortfall at 128K (7B)
Shortfall at 128K (24B)
HBM3 (H100)
3,350 GB/s
in-HBM target
in-HBM target
NVLink
900 GB/s
~11× insufficient
~55× insufficient
PCIe 5 ×16
64 GB/s
~156× insufficient
~781× insufficient
CXL 2.0
32 GB/s
~312× insufficient
~1,562× insufficient
Caveat — advanced inference techniques
These projections assume dense attention. Speculative decoding reduces full KV reads per generated token but
does not reduce the instantaneous bandwidth of the verification kernel.
At realistic acceptance rates (k=2–4), required bandwidth at T=128K is still 2.5–25 TB/s,
exceeding NVLink by 3–28×. Sparse attention (MoBA, sliding window) could reduce per-step traffic —
quantifying that interaction remains future work.
Systems like DualPath,
LMCache, and CXL-SpecKV
convert the tiering constraint from instantaneous bandwidth to a scheduling problem via pipelined transfers.
P8Future Directions — Reduce KV Access, Not Just Compress It
The paper's closing argument: compression and kernel engineering are the current best path,
but future architectures should address the KV bandwidth bottleneck more directly —
constraining KV access proportional to available memory bandwidth rather than compressing the entire cache.
Two concrete external directions that motivate this:
Per-Tensor Precision Policies Based on Architecture-Specific Floors
The KV precision asymmetry discovery directly motivates AKP: rather than uniform quantization,
select per-tensor precision based on the ratio classifier result.
Combine with attention mechanisms that constrain KV access proportional to available memory bandwidth.
The fused kernel's natural support for asymmetric precision (K and V loaded at independent formats)
provides the implementation substrate.
Open Questions
What Remains Unknown
Qwen root cause: The key fragility is family-specific and attenuates at 72B. Does it disappear entirely at larger scale? What in the training pipeline creates the INT7 floor?
Sparse attention interaction: MoBA and sliding-window attention reduce per-step KV traffic. How does that change the tiering bandwidth requirement and the quantization–speculation composition ratio?
Multi-GPU tensor parallel: All results are single-GPU. Tensor parallelism changes the bandwidth-to-compute ratio and may shift the saturation model parameters.
Practical deployment recipe: For 12 of 14 tested models — uniform INT4 KV quantization with the fused P5 kernel. No per-model tuning. 2.7–4.8× decode speedup on H100.
For Qwen family (7B–32B) — KFP16/VINT4 or KINT8/VINT4. 37–62% bandwidth reduction preserving quality.
Determine which to use with the 2-minute ratio classifier.
Code: ratio_classifier.py ·
triton_kernels.py