FULL-STACK VALIDATED H100 · A100 · B200 · W7900 · MI300X 14 Models · 7 Families Asymmetric K16/V8

Memory-Traffic Saturation in
Autoregressive Transformer Decode

Asymmetric K16/V8 KV cache quantization via modified vLLM + FlashInfer matches FP16 perplexity on Qwen2.5-7B at both 2K and 8K context and recovers GSM8K to within 0.5 pp (90.0% vs 90.5% FP16, n=200, 8-shot), while symmetric FP8 collapses (PPL 214→1058, GSM8K 2.0%). The asymmetric path runs end-to-end through the modified serving stack: prefill-time V quantization, paged-cache writes, and asymmetric FlashInfer prefill/decode kernels. At the standalone-kernel level on the headline Qwen2.5-7B lane, K16/V8 matches FP16 within measurement noise and is 33% faster than symmetric FP8 by avoiding the serial K-dequant path. Behind the practical result: throughput scaling is governed primarily by memory traffic, while quantization safety remains model-family dependent. K16/V8 means native-16-bit keys (FP16 or BF16 depending on model dtype) plus FP8 e4m3 values; the Qwen2.5-7B vLLM runs use BF16 K + FP8 V.

⎇ mcgrof/vllm asymmetric-kv-plumbing ⎇ mcgrof/flashinfer asym-prefill-refactor-stage ⎇ mcgrof/LMCache asymmetric-kv-codec

Reproduce ↗ fused-quant docs ↗ Related papers ↗

★ Full-Stack vLLM + FlashInfer Validation — Qwen2.5-7B-Instruct on H100

The asymmetric K16/V8 result now runs end-to-end through the modified vLLM + FlashInfer serving stack with explicit FlashInfer backend selection. The asymmetric path exercises prefill-time V quantization, paged-cache writes, and asymmetric FlashInfer prefill/decode kernels. The table reports controlled FP16 / FP8-sym / K16/V8 measurements on the same H100 path: K16/V8 matches FP16 perplexity to reported precision at both 2K and 8K context and recovers GSM8K to within 0.5 percentage points, while symmetric FP8 collapses on Qwen.

Config	PPL@2K	PPL@8K	GSM8K (n=200)	Smoke tok/s
FP16 baseline	6.997	5.243	90.5%	102.6
FP8 symmetric	214.3	1058.0	2.0%	97.9
Asymmetric K16/V8	6.997	5.243	90.0%	105.8

Symmetric FP8 PPL gets 5× worse from 2K to 8K (214→1058) — the K-fragility phenomenon strengthens at longer context. Asymmetric K16/V8 stays bit-identical to FP16 PPL across both context lengths. This is the final serving-stack validation; the earlier HuggingFace DynamicCache simulation in Section V is retained only as a precision-asymmetry cross-check.

Implementation gates — three failure surfaces, all closed

Gate	Result
FlashInfer decode, BF16-K / FP8-V	rel err 0.0254 vs BF16 ref
FlashInfer prefill, BF16-K / FP8-V	rel err 0.0255 vs BF16 ref
vLLM cache writer	K bit-exact, V within FP8 noise
K/V aliasing check	disjoint storage
Required backend	FlashInfer via `attention_config`

Production recipe

from vllm import LLM, SamplingParams

llm = LLM(
    model="Qwen/Qwen2.5-7B-Instruct",
    dtype="bfloat16",
    kv_cache_dtype=("auto", "fp8_e4m3"),         # K native, V FP8
    attention_config={"backend": "FLASHINFER"},  # required: auto picks FlashAttn
)

The VLLM_ATTENTION_BACKEND environment variable is not honored in this vLLM build — pass attention_config={"backend": "FLASHINFER"} explicitly. Auto-selection picks FlashAttention, which lacks the asymmetric tuple writer.

P0 Asymmetric K16/V8 — The Production Decode Operating Point

Storing keys at native 16-bit precision and only quantizing values to FP8 keeps K dequantization off the softmax serial critical path (it never happens) while V dequantization overlaps with the reduction pipeline. At the standalone FlashInfer kernel level on the headline Qwen2.5-7B lane, asymmetric K16/V8 matches FP16 within measurement noise and is 33% faster than symmetric FP8 by avoiding the serial K-dequant path. Across the three-model vLLM batch-32 throughput comparison (Qwen2.5-7B, Llama-3.1-8B, Marin-8B), asymmetric is 4.8–9.6% faster than symmetric FP8 while providing 1.33× KV cache capacity. Implementation requires modified vLLM/FlashInfer kernels but no model changes, retraining, or calibration.

FP16 baseline PyTorch SDPA · FP16 KV cache · no quantization 1.00×

Symmetric FP8 FP8 K + FP8 V · K dequant on softmax critical path · 3–8% throughput penalty on H100. On fragile-key models (Qwen) collapses: ppl 8.4 → 151 at 32K, NIAH retrieval 0.84 → 0.00. 0.92–0.97× · Qwen collapses

Asymmetric K16/V8 FP16 K + FP8 V via modified FlashInfer · prefill-time V quantization · 1.33× capacity. Rescues fragile-key models to FP16 quality (Qwen NIAH 0.89 at 32K, above FP16 baseline). 1.00–1.38× · universal

Mechanism — why asymmetric is a throughput win, not a compromise

On NVIDIA Hopper, Tensor Core FP8 MMA requires matching operand dtypes, so symmetric FP8 must dequantize K to FP16 before the q · k dot product that feeds softmax. Softmax cannot emit any attention weight until all scores are in — so K dequant serialises ahead of softmax and every cycle spent converting K adds directly to per-token decode latency. Asymmetric K16/V8 keeps K at native precision so no K dequant is needed; V dequant feeds an associative reduction and overlaps with the accumulation pipeline. The hardware-level equivalent on AMD CDNA 3 is WMMA FP8 matmul with scales absorbed into the accumulator — same bottleneck, solved in silicon rather than in software. See the hardware notes.

Quality rescue on fragile-key models

NIAH multikey-3 retrieval at 32K tokens on H100 via vLLM + FlashInfer, across three models:

Qwen2.5-7B (fragile)   FP16: 0.84 · FP8-sym: 0.00 · Asym K16/V8: 0.89
Qwen3-8B (tolerant)   FP16: 0.95 · FP8-sym: 0.92 · Asym K16/V8: 0.95
Llama-3.1-8B (tolerant)   FP16: 1.00 · FP8-sym: 1.00 · Asym K16/V8: 1.00

Qwen2.5-7B under symmetric FP8 suffers total retrieval failure — not graceful degradation, complete collapse. Asymmetric K16/V8 matches FP16 within measurement noise and slightly exceeds it on the Qwen2.5-7B 32K row (0.89 vs 0.84). Tolerant models are unaffected in either direction, so asymmetric is the recommended safe default among the evaluated fused FP8 KV-cache paths.

Code: github.com/mcgrof/flashinfer · branch asym-prefill-refactor-stage (K16/V8 production path with FI-1..FI-5 CUDA template refactor for independent K/V dtypes in prefill+decode). vLLM-side plumbing on asymmetric-kv-plumbing. Integrates with vLLM via

LLM(kv_cache_dtype=("auto", "fp8_e4m3"),
    attention_config={"backend": "FLASHINFER"})

P1 Batch-Driven Saturation — One Number Governs Decode

Across 767 decode-mode configurations with B ≥ 4, a single Hill-type model fitted on batch size alone explains 80% of variance in fused-KV speedup ratio across 14 architectures on H100. Context length, KV bytes per token, GQA ratio, and head dimension are weak predictors after accounting for batch. Architecture does not fundamentally change the workload — memory-traffic physics does.

Data provenance: the saturation coefficients below were fit to the fused INT4 Triton sweep — the original R&D that established the memory-traffic-bound hypothesis across 14 models before the FlashInfer production path existed. The physics claim (decode is governed by batch-driven memory-traffic saturation) carries through to FlashInfer FP8; the specific S_max depends on compression ratio and kernel overhead. See P2 for per-model FlashInfer throughput.

S(B) = S max \cdot B γ / (B ½ γ + B γ) \to S max =3.75 \cdot B ½ =5.1 \cdot γ=1.32 \cdot R²=0.80

Implication

80% of asymptotic speedup is reached at B ≥ 16. 95% at B ≥ 64. Batch size is the dominant deployment knob — not architecture, not GQA ratio, not head dimension.

Batch size

Predicted speedup —

% of max —

Wang et al. 2025 — "A Systematic Characterization of LLM Inference on GPUs" confirms decode exhibits substantially higher memory bandwidth utilization than prefill — arXiv:2512.01644

P2 Per-Model FlashInfer Throughput — Asymmetric K16/V8 vs Symmetric FP8 on H100

vLLM 0.19 + FlashInfer decode throughput measured at batch 32 on H100 SXM5 across eight open-weight models spanning three architectures. The comparison is three-way: FP16 baseline, symmetric FP8, and our asymmetric K16/V8 branch. Asymmetric reaches 1.38× FP16 on Qwen2.5-7B — the largest end-to-end win — and is within a few percent of FP16 on the remaining models, with some small or integration-sensitive cases trailing FP16 (Qwen3-8B at 0.94×, DS-R1-7B at 0.96×, Phi-4-14B at 0.99×). Small-model throughput (0.5–1.5B) sits slightly under FP16 because at that scale per-step bandwidth is dominated by model weights, not KV cache — KV compression contributes less of the total savings. Symmetric FP8 falls below asymmetric on every model because K dequantization under symmetric FP8 lies on the softmax serial critical path, whereas asymmetric keeps K at native precision and only pipelines V dequant.

Asymmetric wins over symmetric FP8 on every tested model

Per-model asymmetric-vs-symmetric ratio ranges from +1.0% (Llama-3.1-8B) to +4.2% (Qwen2.5-1.5B), with the headline +2.9% on Qwen2.5-7B (1.38× vs 1.34× FP16). This gap is structural, not noise: asymmetric eliminates K dequantization entirely whereas symmetric serializes it ahead of softmax. Critically, asymmetric is also the only quality-safe configuration on fragile-key models (see P0 for the NIAH 0.00 → 0.89 retrieval rescue on Qwen2.5-7B).

Source: /data/knlp-key-results/h100-flashinfer/h100_throughput_sweep.json. Measurements at batch 32 through vLLM 0.19 + FlashInfer 0.6.7 with our K16/V8 branch applied (asym-prefill-refactor-stage). Asymmetric K16/V8 requires the small vLLM/FlashInfer dtype-split extension described in the paper — no model changes, no calibration. Symmetric FP8 maps onto existing FlashInfer paged-cache infrastructure without modification.

P2 · R&D Earlier R&D — Fusion Principle Proof (Custom INT4 Triton Kernel)

The asymmetric FlashInfer result above rests on a structural claim: kernel fusion — dequantizing KV inside the attention loop instead of materializing an intermediate FP16 buffer — is what turns KV compression into real decode throughput. That claim was first established by a custom fused INT4 Triton kernel before FlashInfer's production paged-cache infrastructure existed. It remains on the repository as a controlled experiment and is now demoted to R&D/appendix status; the paper's main result is FlashInfer (P0, P2 above).

Non-fused INT4 Dequant to an intermediate FP16 buffer, then call standard attention. The temp-buffer write negates the bandwidth savings from reading smaller values. 0.5× (slowdown vs FP16)

Fused INT4 Triton Dequantize K/V inside the attention tile loop — unpacked values live only in registers and never touch global memory. 2.7–4.8× (H100) · 1.6–7.2× (W7900)

Why this stays on the record

Quantization without fusion is counterproductive. Any deployment that dequantizes KV cache into a temporary buffer before calling attention will see no benefit and likely a regression — validated on both NVIDIA Hopper and AMD RDNA 3 through the Triton ROCm backend. The fused INT4 Triton kernel proves the mechanism in isolation from FlashInfer's implementation details. Full write-up and kernel source: knlp/docs/fused-quant; appendix of the paper.

P3 Traffic Regime: Weight-Bound → KV-Bound Transition

The widely cited claim that decode is "KV-bandwidth bound" is incomplete at 7B scale. Direct bandwidth decomposition of the verification step on A100 shows that at 7B with 2K context, model weight reads constitute 94–99% of per-step bandwidth while KV reads are under 2%. The KV-bound regime emerges only at longer contexts, larger batch sizes, or larger models.

Weight-bound regime

94–99%

weights dominate
small model, short ctx
KV compression → modest benefit

KV-bound regime

→ 50%+

KV cache dominates
long ctx · large batch · large model
asymmetric K16/V8 → full win

↔

→ batch size · context length · model scale increase →

GQA crossover point

Qwen2.5-7B uses only 4 KV heads vs 8 for Mistral/Llama, halving its KV traffic per step. At 8K context, Qwen's KV fraction (2.99%) is less than half of Mistral's (6.90%). The crossover to KV-dominated bandwidth occurs at shorter contexts for models with more KV heads.

Wu et al. 2026 — "DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference" — external KV-cache storage I/O in agentic disaggregated serving; ~95–98% token reuse rates — arXiv:2602.21548

P4 KV Precision Asymmetry — Keys and Values Are Not the Same

Direct KV activation quantization experiments reveal a striking asymmetry: values consistently tolerate INT4 across all tested models. Keys exhibit model-dependent precision floors. On Qwen2.5-7B, compressing keys to INT4 (with INT8 values) causes catastrophic collapse — 17,681% PPL increase, 35,000:1 sensitivity asymmetry. On Mistral-7B, the same configuration causes only +1.23% PPL with 0.98 token agreement. This is a phase transition, not gradual degradation: the Qwen precision cliff is at INT7↔INT6 boundary.

SENSITIVE Qwen2.5-7B

K / V bitsΔPPLseverityagree

TOLERANT Mistral-7B

K / V bitsΔPPLseverityagree

Why architectural predictors fail

No architectural feature predicts key quantization sensitivity: GQA ratio (ρ=−0.18), RoPE θ (ρ=0.01), attention entropy (r=0.29 at 6 models, down from r=0.89 at 3). Fisher information and covariance-based signals show no significant correlation (Spearman ρ < 0.2, p > 0.3). Sensitivity is a family-specific learned characteristic, not an architectural property. See Spearman ρ explainer.

2-minute runtime calibration test

Run 5 calibration prompts through INT8 and INT6 key configs (with INT4 values). Compute mean logit error ratio INT6/INT8. Threshold τ = 3.0 achieves 100% accuracy on 13/13 evaluable models. Qwen family: ratio 5.07–5.40 (flag as sensitive). All others: ratio < 2.2 (safe for INT4–INT6 keys). Generalizes to 70B+ scale with zero false positives.

Scale attenuation. Qwen key sensitivity attenuates with model scale: catastrophic at 7B (>10% PPL), +49% at 32B, +1.55% at 72B — below practical threshold. Larger models distribute quantization error across more layers. KFP16/VINT4 remains safe at all Qwen scales (+0.31% at 72B).

Asymmetric FP16-K / FP8-V rescues fragile-key models without calibration

On Qwen2.5-7B the same fragility observed under INT4 keys reappears under symmetric FP8: WikiText-2 perplexity jumps from 5.2 (FP16) to 774–1046 at 2–16K context under uncalibrated symmetric FP8. Keeping K at FP16 and only quantizing V to FP8 recovers ppl to 5.23–6.07 across the same context range, GSM8K accuracy from 0.0% back to 84.5%, and MMLU from 48.9% back to 76.2% — essentially indistinguishable from FP16 baseline. No calibration, no per-model tuning; asymmetric K16/V8 is the safe universal default. See P0 for throughput data and the hardware notes for why the asymmetric pattern is a software workaround for a Hopper Tensor Core ISA constraint.

Calibration cannot rescue Qwen — static per-tensor FP8 is worse than uncalibrated

A common reviewer concern is that the symmetric-FP8 baseline is a strawman vs the standard production calibration path. We tested directly with llm-compressor (the standard NVIDIA-aligned KV-quant calibration tool), 512-sample WikiText-2 calibration set at T=2048, per-tensor static FP8 e4m3 scales loaded by vLLM. On Qwen2.5-7B the result is worse than uncalibrated symmetric FP8: WikiText-2 perplexity jumps to 9.6×10⁵ with GSM8K accuracy at 0%, vs uncalibrated FP8's PPL ≈ 157 / GSM8K 0.5%. The mechanism is the dual of dynamic-calibration failure: a per-tensor scale wide enough to absorb Qwen's K outliers leaves typical channel values rounded to a coarser grid than the unconditional default, destroying bulk precision in exchange for retaining a handful of outlier samples. Asymmetric K16/V8 under the identical protocol yields PPL 7.50 matching FP16 to four decimals and GSM8K 84.5% — a 169.5-point absolute improvement over static-calibrated FP8. Calibration cannot solve a precision-floor problem; what it can do is shift which part of the distribution gets sacrificed. Archive at qwen-fragility-bundled-20260425.

Hybrid linear + full-attention models — Qwen3.6-27B (Qwen3.5 family)

Qwen3.6-27B uses Qwen3_5ForConditionalGeneration — a hybrid architecture where every fourth layer carries a true KV cache (the rest are linear attention). At T=2048: FP16 PPL 7.358, FP8-sym PPL 7.347 (drops 10.5 absolute points on GSM8K, 35.5% → 25.0%, n=200 8-shot), asymmetric K16/V8 PPL 7.358 with GSM8K 35.5% — sample-by-sample identical to FP16. Asymmetric is also fastest at evaluation time (24 s for the WikiText-2 sweep vs 79 s FP16, 42 s FP8-sym) due to smaller V cache enabling more concurrency. The K-dequant throughput penalty appears across batch and context too: at T=16,384, sym FP8 is 0.78×–0.80× of FP16 throughput across B ∈ [4, 32]. Even on a model where 75% of layers do not use the standard KV cache at all, asymmetric delivers FP16-equivalent quality and a real throughput improvement.

Large-model FP8 activation quant at 72B (closes proxy gap)

Earlier 27B–72B numbers used hook-based weight-quantization as a proxy for activation sensitivity. We now have a direct activation-quant measurement: Qwen2.5-72B-Instruct (FP8-dynamic weights, ~72 GB on H200) on WikiText-103 at T=2048 over 262K tokens. FP16-KV PPL 4.337, FP8-sym PPL 4.355 (+0.41%), asymmetric K16/V8 PPL 4.337 — NLL identical to FP16 to six decimals. Asymmetric matches FP16 exactly at 72B parameters under real activation quantization, on a single GPU; the proxy and the real path agree.

Consistent with KIVI (Liu et al. 2024) which observes channel-wise outlier structure differences between keys and values. Our finding extends from distribution asymmetry to minimum viable bit-width asymmetry. KIVI: arXiv:2402.02750

P5 Cross-GPU Validation — Five Matched Lanes

The same qualitative pattern holds across AMD W7900 (RDNA 3), NVIDIA A100 (Ampere), NVIDIA H100 (Hopper), NVIDIA B200 (Blackwell), and AMD MI300X (CDNA 3). Throughput ordering follows hardware memory-system strength. Latency remains approximately linear in context length. Cross-platform variation is explained better by sustained decode-bandwidth plateau than by headline peak HBM bandwidth. B200 has 1.96× the H100 peak bandwidth but decode throughput gain is smaller — practical kernel bandwidth at decode batch sizes remains well below peak on both architectures. On the AMD side, MI300X's CDNA 3 WMMA FP8 path eliminates the K-dequant serial penalty that Hopper's Tensor Core ISA imposes, so symmetric FP8 carries a much smaller throughput cost on MI300X (1.7–2.5%) than on H100 (3–8%) — the asymmetric-vs-symmetric ordering is different across vendors for structural reasons.

GPU	Memory	Peak BW	Tok/s (B=1, T=4096)	Tok/s (B=8, T=4096)	Batch response	Max context (B=1)
AMD W7900	48 GB	864 GB/s	1.2k	1.3k	near-linear	32K
NVIDIA A100	80 GB	2039 GB/s	12.7k	15.6k	early saturation	32K
NVIDIA H100	80 GB	3350 GB/s	20.1k	37.1k	saturating	32K
NVIDIA B200	178 GB	6550 GB/s	24.7k	36.5k	saturating	384K
AMD MI300X	192 GB	5300 GB/s	—	—	saturating	32K+

B200 extreme context

178 GB HBM enables 384K-token contexts at B=1 (149 GB peak memory, 20.5 ms decode latency) — a 12× improvement over the 32K practical limit on H100 and W7900. HBM capacity is the primary constraint for long-context inference. INT4 KV compression could extend this beyond 512K tokens on the same hardware.

P6 Speculation × Quantization — Sub- and Super-Multiplicative Composition

Speculative decoding does not relax bandwidth constraints. The verification step remains a full attention pass over the entire KV cache. Composition with KV quantization is: sub-multiplicative (ρ ≈ 0.62) for aggressively grouped models (Qwen, 4 KV heads) where KV is a small fraction of total bandwidth, and super-multiplicative (ρ up to 1.95) at long context for models with more KV heads (Llama, 8 KV heads) where KV becomes the bottleneck.

ρ = S combined / (S quant \times S spec) ρ < 1.0 sub-mult \cdot ρ = 1.0 mult \cdot ρ > 1.0 super-mult

Batch kills speculation faster than context

At B=1, acceptance ≥ 0.81 through 8K context. At B=16, acceptance is already below 0.65 at 2K context for two of three models. Production serving (typically B ≥ 8) operates precisely in the regime where n-gram speculation provides diminishing returns.

Tree verification paradox

Tree speculation helps at low acceptance (+21% at α=0.3) but hurts at high acceptance (−17% at α=0.9). Use linear speculation at B ≤ 4. Consider tree only at B ≥ 16 with context > 8K where acceptance < 0.5.

P7 Memory Tiering — No Current Interconnect Is Close

Under synchronous dense attention, KV cache must reside in HBM for real-time decode. At T=128K, serving overflowed KV from a secondary tier at 10 ms decode latency requires 10–50 TB/s bandwidth depending on model size. All current interconnects fall orders of magnitude short.

Interconnect	Peak BW	Shortfall at 128K (7B)	Shortfall at 128K (24B)
HBM3 (H100)	3,350 GB/s	in-HBM target	in-HBM target
NVLink	900 GB/s	~11× insufficient	~55× insufficient
PCIe 5 ×16	64 GB/s	~156× insufficient	~781× insufficient
CXL 2.0	32 GB/s	~312× insufficient	~1,562× insufficient

Caveat — advanced inference techniques

These projections assume dense attention. Speculative decoding reduces full KV reads per generated token but does not reduce the instantaneous bandwidth of the verification kernel. At realistic acceptance rates (k=2–4), required bandwidth at T=128K is still 2.5–25 TB/s, exceeding NVLink by 3–28×. Sparse attention (sliding window) could reduce per-step traffic — quantifying that interaction remains future work.

Systems like DualPath, LMCache, and CXL-SpecKV convert the tiering constraint from instantaneous bandwidth to a scheduling problem via pipelined transfers.

P8 Future Directions — Reduce KV Access, Not Just Compress It

The paper's closing argument: compression and kernel engineering are the current best path, but future architectures should address the KV bandwidth bottleneck more directly — constraining KV access proportional to available memory bandwidth rather than compressing the entire cache. Two concrete external directions that motivate this:

MXFP8 on Blackwell

Block-Scaled Microscaling FP8 as a Blackwell-Native Closure

The asymmetric FP16-K / FP8-V pattern is a software workaround for Hopper's FP8 Tensor Core operand-matching requirement. Blackwell preserves that requirement for the legacy FP8 path but introduces block-scaled MXFP8 with hardware scale absorption during matmul — the same architectural pattern AMD WMMA FP8 already implements on CDNA 3. A production-quality MXFP8 paged-attention kernel would close the Hopper bottleneck on Blackwell at hardware level. No such kernel exists in the open-source inference stack at time of writing.

Open Questions

What Remains Unknown

Qwen root cause: The key fragility is family-specific and attenuates at 72B. Does it disappear entirely at larger scale? What in the training pipeline creates the INT7 floor?

Sparse attention interaction: Sliding-window and other sparse-attention variants reduce per-step KV traffic. How does that change the tiering bandwidth requirement and the quantization–speculation composition ratio?

Multi-GPU tensor parallel: All results are single-GPU. Tensor parallelism changes the bandwidth-to-compute ratio and may shift the saturation model parameters.

Practical deployment recipe: The recommended safe default among the evaluated fused FP8 KV-cache paths is asymmetric K16/V8 via the modified FlashInfer + vLLM (P0). It provides 1.33× KV-cache capacity, avoids the Qwen-family quality failures of symmetric FP8, and is within measurement noise of FP16 at the standalone-kernel level. End-to-end throughput depends on model size, context, batch, and serving-stack overhead. The implementation requires modified vLLM/FlashInfer kernels but no model changes, retraining, or calibration. For maximum capacity, symmetric FP8 doubles KV storage at a 3–8% throughput cost on H100 (1.7–2.5% on MI300X), and stays quality-safe on tolerant models. The custom fused INT4 Triton kernel (P2) remains available in the repository as a controlled experiment that proves the fusion mechanism in isolation — it is not the production path.
Code: FlashInfer fork · fused-quant docs · hardware notes · ratio_classifier.py

RP Related Papers

External work that addresses adjacent points in the decode-bottleneck design space. These are the systems whose constraints are quoted or whose results are used as scaffolding throughout the paper; this is not an exhaustive related-work survey.

DualPath — arXiv:2602.21548

Wu et al. 2026 — Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference

External KV-cache storage I/O for agentic disaggregated serving. ~95–98% token reuse rates in multi-turn workloads mean warm prefix caches, but per-step KV reads inside the attention kernel are unchanged. DualPath is the system-level scheduling complement to this paper's kernel-level result.

Connection: our work establishes the instantaneous bandwidth requirement DualPath-style systems must route around; the tiering projections (P7) are the quantitative input to that scheduling problem.

LMCache — arXiv:2510.09665

Liu et al. 2025 — KV Reuse for Long-Context LLM Serving

Hierarchical KV-cache reuse across GPU, CPU, and storage tiers with pipelined transfers. Treats long-context inference as a cache-sharing problem instead of a per-request recomputation problem.

Connection: the asymmetric K16/V8 codec and the split-tier K-host / V-NVMe placement (P7, Section IX-E) are implemented on the LMCache asymmetric-kv-codec branch and reuse the LMCache SplitTierStore directly.

CXL-SpecKV

Liu et al. 2026 — Disaggregated KV with FPGA-Attached CXL Memory

Disaggregates the KV cache to FPGA-attached CXL memory and uses speculative prefetching to hide the slower-than-HBM tier behind draft-model lookahead.

Connection: a concrete proposal for the kind of overlap-the-cold-tier scheduling our P7 bandwidth bound implies is necessary — synchronous dense decode cannot wait on CXL latency, but asynchronous prefetch backed by speculation can.

REP Reproduce — Companion Repos and Defconfig Workflow

The full-stack vLLM + FlashInfer asymmetric K16/V8 result depends on three modified serving-stack repos. All three are public on GitHub with the exact branches the paper measurements were taken on.

Repo	Branch	What it contains
mcgrof/vllm	`asymmetric-kv-plumbing`	Tuple K/V cache, FlashAttn writer patch, asym dtype plumbing
mcgrof/flashinfer	`asym-prefill-refactor-stage`	FI-1..FI-5 CUDA template refactor for independent K/V dtypes in prefill+decode
mcgrof/LMCache	`asymmetric-kv-codec`	K16/V8 codec, split-tier placement, serde, 74 CPU unit tests

The knlp defconfig system provides one-command reproduction of the core paper findings:

git clone https://github.com/mcgrof/knlp.git && cd knlp
make defconfig-decode       # Core asym claims (1×H100, ~4-8 h warm)
make

Three reproduction profiles are planned: defconfig-decode (core full-stack quality battery + standalone FlashInfer gates + LMCache codec checks), defconfig-decode-sat (saturation model + Hill fit), and defconfig-decode-full (everything possible from the paper, with structured skip reports when cross-GPU hardware is missing). Each defconfig pins exact git refs, clones the three forks into the parent directory, builds the modified serving stack, runs the configured stages, and writes machine-readable artifacts to results/decode/<run_id>/. Local JSONL telemetry is canonical; W&B and trackerio are optional mirrors. See docs/reproduce/paper-memory-decode.md for the full quickstart, runtime estimates, hardware requirements, and AI-agent instructions.

Memory-Traffic Saturation inAutoregressive Transformer Decode

Three visualizations that build the foundation

SENSITIVE Qwen2.5-7B

TOLERANT Mistral-7B

Memory-Traffic Saturation in
Autoregressive Transformer Decode