Asymmetric K16/V8 KV cache quantization via modified vLLM + FlashInfer matches FP16 perplexity on Qwen2.5-7B at both 2K and 8K context and recovers GSM8K to within 0.5 pp (90.0% vs 90.5% FP16, n=200, 8-shot), while symmetric FP8 collapses (PPL 214→1058, GSM8K 2.0%). The asymmetric path runs end-to-end through the modified serving stack: prefill-time V quantization, paged-cache writes, and asymmetric FlashInfer prefill/decode kernels. At the standalone-kernel level on the headline Qwen2.5-7B lane, K16/V8 matches FP16 within measurement noise and is 33% faster than symmetric FP8 by avoiding the serial K-dequant path. Behind the practical result: throughput scaling is governed primarily by memory traffic, while quantization safety remains model-family dependent. K16/V8 means native-16-bit keys (FP16 or BF16 depending on model dtype) plus FP8 e4m3 values; the Qwen2.5-7B vLLM runs use BF16 K + FP8 V.
Before reading the paper findings, these three interactive explainers establish the structural facts the paper measures empirically. Each one is a prerequisite for understanding why the results below matter.
The asymmetric K16/V8 result now runs end-to-end through the modified vLLM + FlashInfer serving stack with explicit FlashInfer backend selection. The asymmetric path exercises prefill-time V quantization, paged-cache writes, and asymmetric FlashInfer prefill/decode kernels. The table reports controlled FP16 / FP8-sym / K16/V8 measurements on the same H100 path: K16/V8 matches FP16 perplexity to reported precision at both 2K and 8K context and recovers GSM8K to within 0.5 percentage points, while symmetric FP8 collapses on Qwen.
| Config | PPL@2K | PPL@8K | GSM8K (n=200) | Smoke tok/s |
|---|---|---|---|---|
| FP16 baseline | 6.997 | 5.243 | 90.5% | 102.6 |
| FP8 symmetric | 214.3 | 1058.0 | 2.0% | 97.9 |
| Asymmetric K16/V8 | 6.997 | 5.243 | 90.0% | 105.8 |
Symmetric FP8 PPL gets 5× worse from 2K to 8K (214→1058) — the K-fragility phenomenon strengthens
at longer context. Asymmetric K16/V8 stays bit-identical to FP16 PPL across both context lengths.
This is the final serving-stack validation; the earlier HuggingFace DynamicCache simulation in
Section V is retained only as a precision-asymmetry cross-check.
| Gate | Result |
|---|---|
| FlashInfer decode, BF16-K / FP8-V | rel err 0.0254 vs BF16 ref |
| FlashInfer prefill, BF16-K / FP8-V | rel err 0.0255 vs BF16 ref |
| vLLM cache writer | K bit-exact, V within FP8 noise |
| K/V aliasing check | disjoint storage |
| Required backend | FlashInfer via attention_config |
from vllm import LLM, SamplingParams
llm = LLM(
model="Qwen/Qwen2.5-7B-Instruct",
dtype="bfloat16",
kv_cache_dtype=("auto", "fp8_e4m3"), # K native, V FP8
attention_config={"backend": "FLASHINFER"}, # required: auto picks FlashAttn
)
The VLLM_ATTENTION_BACKEND environment variable is not honored in this vLLM build —
pass attention_config={"backend": "FLASHINFER"} explicitly. Auto-selection picks FlashAttention,
which lacks the asymmetric tuple writer.
Storing keys at native 16-bit precision and only quantizing values to FP8 keeps K dequantization off the softmax serial critical path (it never happens) while V dequantization overlaps with the reduction pipeline. At the standalone FlashInfer kernel level on the headline Qwen2.5-7B lane, asymmetric K16/V8 matches FP16 within measurement noise and is 33% faster than symmetric FP8 by avoiding the serial K-dequant path. Across the three-model vLLM batch-32 throughput comparison (Qwen2.5-7B, Llama-3.1-8B, Marin-8B), asymmetric is 4.8–9.6% faster than symmetric FP8 while providing 1.33× KV cache capacity. Implementation requires modified vLLM/FlashInfer kernels but no model changes, retraining, or calibration.
asym-prefill-refactor-stage
(K16/V8 production path with FI-1..FI-5 CUDA template refactor for independent K/V dtypes in
prefill+decode). vLLM-side plumbing on
asymmetric-kv-plumbing.
Integrates with vLLM via LLM(kv_cache_dtype=("auto", "fp8_e4m3"),
attention_config={"backend": "FLASHINFER"}).
Across 767 decode-mode configurations with B ≥ 4, a single Hill-type model fitted on batch size alone explains 80% of variance in fused-KV speedup ratio across 14 architectures on H100. Context length, KV bytes per token, GQA ratio, and head dimension are weak predictors after accounting for batch. Architecture does not fundamentally change the workload — memory-traffic physics does.
Data provenance: the saturation coefficients below were fit to the fused INT4 Triton sweep — the original R&D that established the memory-traffic-bound hypothesis across 14 models before the FlashInfer production path existed. The physics claim (decode is governed by batch-driven memory-traffic saturation) carries through to FlashInfer FP8; the specific Smax depends on compression ratio and kernel overhead. See P2 for per-model FlashInfer throughput.
vLLM 0.19 + FlashInfer decode throughput measured at batch 32 on H100 SXM5 across eight open-weight models spanning three architectures. The comparison is three-way: FP16 baseline, symmetric FP8, and our asymmetric K16/V8 branch. Asymmetric reaches 1.38× FP16 on Qwen2.5-7B — the largest end-to-end win — and is within a few percent of FP16 on the remaining models, with some small or integration-sensitive cases trailing FP16 (Qwen3-8B at 0.94×, DS-R1-7B at 0.96×, Phi-4-14B at 0.99×). Small-model throughput (0.5–1.5B) sits slightly under FP16 because at that scale per-step bandwidth is dominated by model weights, not KV cache — KV compression contributes less of the total savings. Symmetric FP8 falls below asymmetric on every model because K dequantization under symmetric FP8 lies on the softmax serial critical path, whereas asymmetric keeps K at native precision and only pipelines V dequant.
/data/knlp-key-results/h100-flashinfer/h100_throughput_sweep.json.
Measurements at batch 32 through vLLM 0.19 + FlashInfer 0.6.7 with our K16/V8 branch applied
(asym-prefill-refactor-stage).
Asymmetric K16/V8 requires the small vLLM/FlashInfer dtype-split extension described in the paper —
no model changes, no calibration. Symmetric FP8 maps onto existing FlashInfer paged-cache infrastructure
without modification.
The asymmetric FlashInfer result above rests on a structural claim: kernel fusion — dequantizing KV inside the attention loop instead of materializing an intermediate FP16 buffer — is what turns KV compression into real decode throughput. That claim was first established by a custom fused INT4 Triton kernel before FlashInfer's production paged-cache infrastructure existed. It remains on the repository as a controlled experiment and is now demoted to R&D/appendix status; the paper's main result is FlashInfer (P0, P2 above).
The widely cited claim that decode is "KV-bandwidth bound" is incomplete at 7B scale. Direct bandwidth decomposition of the verification step on A100 shows that at 7B with 2K context, model weight reads constitute 94–99% of per-step bandwidth while KV reads are under 2%. The KV-bound regime emerges only at longer contexts, larger batch sizes, or larger models.
Direct KV activation quantization experiments reveal a striking asymmetry: values consistently tolerate INT4 across all tested models. Keys exhibit model-dependent precision floors. On Qwen2.5-7B, compressing keys to INT4 (with INT8 values) causes catastrophic collapse — 17,681% PPL increase, 35,000:1 sensitivity asymmetry. On Mistral-7B, the same configuration causes only +1.23% PPL with 0.98 token agreement. This is a phase transition, not gradual degradation: the Qwen precision cliff is at INT7↔INT6 boundary.
llm-compressor (the standard NVIDIA-aligned KV-quant
calibration tool), 512-sample WikiText-2 calibration set at T=2048, per-tensor static FP8 e4m3 scales loaded
by vLLM. On Qwen2.5-7B the result is worse than uncalibrated symmetric FP8: WikiText-2
perplexity jumps to 9.6×105 with GSM8K accuracy at 0%, vs uncalibrated FP8's
PPL ≈ 157 / GSM8K 0.5%. The mechanism is the dual of dynamic-calibration failure: a per-tensor scale wide
enough to absorb Qwen's K outliers leaves typical channel values rounded to a coarser grid than the
unconditional default, destroying bulk precision in exchange for retaining a handful of outlier samples.
Asymmetric K16/V8 under the identical protocol yields PPL 7.50 matching FP16 to four decimals and GSM8K
84.5% — a 169.5-point absolute improvement over static-calibrated FP8. Calibration cannot solve a
precision-floor problem; what it can do is shift which part of the distribution gets sacrificed.
Archive at
qwen-fragility-bundled-20260425.
Qwen3_5ForConditionalGeneration — a hybrid architecture where every fourth layer
carries a true KV cache (the rest are linear attention). At T=2048: FP16 PPL 7.358, FP8-sym PPL 7.347
(drops 10.5 absolute points on GSM8K, 35.5% → 25.0%, n=200 8-shot), asymmetric K16/V8 PPL 7.358 with
GSM8K 35.5% — sample-by-sample identical to FP16. Asymmetric is also fastest at evaluation time
(24 s for the WikiText-2 sweep vs 79 s FP16, 42 s FP8-sym) due to smaller V cache enabling more concurrency.
The K-dequant throughput penalty appears across batch and context too: at T=16,384, sym FP8 is 0.78×–0.80×
of FP16 throughput across B ∈ [4, 32]. Even on a model where 75% of layers do not use the standard KV cache
at all, asymmetric delivers FP16-equivalent quality and a real throughput improvement.
The same qualitative pattern holds across AMD W7900 (RDNA 3), NVIDIA A100 (Ampere), NVIDIA H100 (Hopper), NVIDIA B200 (Blackwell), and AMD MI300X (CDNA 3). Throughput ordering follows hardware memory-system strength. Latency remains approximately linear in context length. Cross-platform variation is explained better by sustained decode-bandwidth plateau than by headline peak HBM bandwidth. B200 has 1.96× the H100 peak bandwidth but decode throughput gain is smaller — practical kernel bandwidth at decode batch sizes remains well below peak on both architectures. On the AMD side, MI300X's CDNA 3 WMMA FP8 path eliminates the K-dequant serial penalty that Hopper's Tensor Core ISA imposes, so symmetric FP8 carries a much smaller throughput cost on MI300X (1.7–2.5%) than on H100 (3–8%) — the asymmetric-vs-symmetric ordering is different across vendors for structural reasons.
| GPU | Memory | Peak BW | Tok/s (B=1, T=4096) | Tok/s (B=8, T=4096) | Batch response | Max context (B=1) |
|---|---|---|---|---|---|---|
| AMD W7900 | 48 GB | 864 GB/s | 1.2k | 1.3k | near-linear | 32K |
| NVIDIA A100 | 80 GB | 2039 GB/s | 12.7k | 15.6k | early saturation | 32K |
| NVIDIA H100 | 80 GB | 3350 GB/s | 20.1k | 37.1k | saturating | 32K |
| NVIDIA B200 | 178 GB | 6550 GB/s | 24.7k | 36.5k | saturating | 384K |
| AMD MI300X | 192 GB | 5300 GB/s | — | — | saturating | 32K+ |
Speculative decoding does not relax bandwidth constraints. The verification step remains a full attention pass over the entire KV cache. Composition with KV quantization is: sub-multiplicative (ρ ≈ 0.62) for aggressively grouped models (Qwen, 4 KV heads) where KV is a small fraction of total bandwidth, and super-multiplicative (ρ up to 1.95) at long context for models with more KV heads (Llama, 8 KV heads) where KV becomes the bottleneck.
Under synchronous dense attention, KV cache must reside in HBM for real-time decode. At T=128K, serving overflowed KV from a secondary tier at 10 ms decode latency requires 10–50 TB/s bandwidth depending on model size. All current interconnects fall orders of magnitude short.
| Interconnect | Peak BW | Shortfall at 128K (7B) | Shortfall at 128K (24B) |
|---|---|---|---|
| HBM3 (H100) | 3,350 GB/s | in-HBM target | in-HBM target |
| NVLink | 900 GB/s | ~11× insufficient | ~55× insufficient |
| PCIe 5 ×16 | 64 GB/s | ~156× insufficient | ~781× insufficient |
| CXL 2.0 | 32 GB/s | ~312× insufficient | ~1,562× insufficient |
The paper's closing argument: compression and kernel engineering are the current best path, but future architectures should address the KV bandwidth bottleneck more directly — constraining KV access proportional to available memory bandwidth rather than compressing the entire cache. Two concrete external directions that motivate this:
The full-stack vLLM + FlashInfer asymmetric K16/V8 result depends on three modified serving-stack repos. All three are public on GitHub with the exact branches the paper measurements were taken on.
| Repo | Branch | What it contains |
|---|---|---|
| mcgrof/vllm | asymmetric-kv-plumbing |
Tuple K/V cache, FlashAttn writer patch, asym dtype plumbing |
| mcgrof/flashinfer | asym-prefill-refactor-stage |
FI-1..FI-5 CUDA template refactor for independent K/V dtypes in prefill+decode |
| mcgrof/LMCache | asymmetric-kv-codec |
K16/V8 codec, split-tier placement, serde, 74 CPU unit tests |
The knlp defconfig system provides one-command reproduction of the core paper findings:
git clone https://github.com/mcgrof/knlp.git && cd knlp
make defconfig-decode # Core asym claims (1×H100, ~4-8 h warm)
make
Three reproduction profiles are planned: defconfig-decode (core full-stack quality
battery + standalone FlashInfer gates + LMCache codec checks),
defconfig-decode-sat (saturation model + Hill fit), and
defconfig-decode-full (everything possible from the paper, with structured skip reports
when cross-GPU hardware is missing). Each defconfig pins exact git refs, clones the three forks into
the parent directory, builds the modified serving stack, runs the configured stages, and writes
machine-readable artifacts to results/decode/<run_id>/. Local JSONL telemetry is
canonical; W&B and trackerio are optional mirrors. See
docs/reproduce/paper-memory-decode.md
for the full quickstart, runtime estimates, hardware requirements, and AI-agent instructions.