Ridge Point Implications for KV Cache Streaming Attention
A GPU has two fundamental limits: how fast it can move bytes, and how fast it can compute. Where those limits intersect is the ridge point. For KV cache streaming attention during decode, the ratio of compute to memory traffic — arithmetic intensity, measured in FLOPs per byte — stays near 1 F/B regardless of context length or model size. The workload is strongly bandwidth-limited, and understanding why requires looking at what the hardware is actually doing. These are theoretical spec-sheet rooflines; sustained measured performance is lower and kernel-dependent.
01 / FOUNDATION
Every kernel does two things: move bytes, then crunch numbers
Before any FLOPs can happen, data has to come from somewhere — HBM memory, L2 cache, system RAM. The GPU loads bytes, does math on them, and writes results back. That's it.
🗄️
HBM Memory
where data lives
bandwidthGB/s
⚡
Registers / L1
on-chip, fast
FLOPsTFLOPS
🧮
Tensor Cores
do the math
write back
🗄️
HBM Memory
results stored
KEY RATIO
For any kernel, you can measure how much math it does per byte it moves. This ratio is called Arithmetic Intensity:
AI = FLOPs ÷ bytes moved
units: FLOP/byte
To make the math tractable, we use a single-head asymptotic model: count only the K and V reads
and the two attention matmuls for one head, and assume context length is long enough that fixed overheads
like the query load and output write are negligible. Under that model:
a matrix multiply on large matrices does ~512 FLOPs per byte — lots of reuse.
Streaming decode attention over the KV cache does ~1 FLOP per byte — almost none.
A 512× gap in arithmetic intensity between a large-batch GEMM and a decode attention pass
is the whole story of why transformer inference is hard to make fast.
02 / THE TWO CEILINGS
Your GPU has two speed limits. You always hit the lower one.
No matter how many FLOPs the spec sheet claims, actual performance is bounded by whichever resource runs out first — bandwidth or compute.
BANDWIDTH CEILING
Memory-Bound
The GPU can't compute faster than bytes arrive. If it's starved for data, the tensor cores sit idle waiting.
Performance = BW × AI More bandwidth or higher AI → goes faster.
perf = BW [GB/s] × AI [F/B] ÷ 1000
H100 bandwidth · 3350 GB/s
COMPUTE CEILING
Compute-Bound
The GPU can't compute faster than its tensor cores allow. At some point data arrives faster than it can be processed.
Performance = peak TFLOPS Fixed ceiling. More AI doesn't help.
perf = peak [TFLOPS] (flat ceiling)
H100 dense BF16/FP16 · ~989 TFLOPS (inferred as ½ sparse)
ACTUAL PERFORMANCE = min(bandwidth ceiling, compute ceiling)
BW ceiling
—
compute ceiling
—
actual perf
—
03 / THE RIDGE POINT
The ridge point: where the two ceilings meet
There is exactly one arithmetic intensity where BW × AI = peak TFLOPS.
That crossover is the ridge point. Below it you're memory-bound. Above it you're compute-bound.
And since AI is a property of the algorithm — not a free dial — most kernels sit in a fixed region of this chart.
Pick an illustrative kernel profile — AI values are rough didactic ranges, not measured data
AI is not a dial you turn. It's a property of what the kernel computes.
Streaming decode attention over the KV cache lands near ~1 F/B (FP16) in a simplified per-head model — you can't wish it higher without changing the algorithm.
Click any profile below to see where it lands on the H100 dense BF16/FP16 theoretical roofline.
KERNEL AI
—
H100 RIDGE
295 F/B
dense BF16/FP16 TC ref
ACTUAL PERF
—
SO HOW DO YOU ACTUALLY INCREASE AI?
You can't change arithmetic intensity by flipping a switch. It requires redesigning the kernel.
There are three real levers:
① OPERATOR FUSION
Fuse multiple kernels into one pass so intermediate results stay in registers/L1 instead of writing back to HBM.
FlashAttention: tiles the QKᵀ matmul, softmax, and AV matmul into a single kernel pass, keeping intermediate scores in on-chip SRAM. The quadratic S/P matrices are never written to HBM — substantially reducing HBM traffic for long sequences.
② INCREASE BATCH SIZE
For GEMMs, larger batches reuse the same weight bytes across more tokens.
For a 4096×4096 FP16 GEMM: B=1 → ~1 F/B, B=16 → ~16 F/B, B=64 → ~62 F/B, B=512 → ~410 F/B.
Scaling is sublinear — input and output bytes also grow with B. Ridge crossed around B≈345 for this shape.
This is the only free lunch: if you can batch more requests together, GEMM AI climbs toward the ridge. But KV cache AI stays fixed regardless of batch size.
③ COMPRESS THE BYTES
Same FLOPs, fewer bytes moved → AI goes up. This is what KV quantization and MLA do.
4× KV compression → 4× higher AI in the single-head streaming model. Whole-layer AI with GQA also improves.
Doesn't push past the ridge, but meaningfully reduces total HBM traffic.
THE HARD TRUTH
Even with all three techniques combined, most transformer inference kernels still never reach the ridge point (~200–295 F/B on modern GPUs). Common practical compression and head-sharing schemes can move whole-layer effective AI into the low tens of F/B; more aggressive MLA-style compression can go higher in principle but real kernels land lower once overheads are counted. The ridge is a theoretical ceiling, not a realistic target for these workloads.
FORMULA
ridge [F/B] = peak TFLOPS ÷ bandwidth [TB/s]
dense BF16/FP16 theoretical · spec-sheet peaks · sustained measured rooflines are lower
H100
989 ÷ 3.35 = 295
H200
989 ÷ 4.80 = 206
B200 (HGX)
2250 ÷ 8.0 = 281
MI300X
1307 ÷ 5.30 = 247
⚠ A GPU does not have one universal ridge point. The ridge depends on which compute ceiling you use.
H100 tensor-core FP16 gives ~295 F/B. H100 FP32 vector throughput gives ~20 F/B. Non-tensor-core kernels
(layernorm, softmax, gather) cannot use the full tensor-core ceiling, so their effective ridge is much lower.
04 / TRANSFORMER WORKLOADS
Where do actual transformer kernels land on this chart?
Most transformer operations have very low arithmetic intensity — they spend most time waiting on bandwidth, not computing.
Values below are illustrative rough ranges; actual measurements vary by implementation, hardware, and batch size.
The ridge shown is H100 dense BF16/FP16 theoretical (~295 F/B). Non-tensor-core kernels have lower effective ridges.
KERNEL
ARITH INTENSITY
REGIME
WHY
Streaming decode attention (KV)
~1 F/B (FP16)
MEMORY
Per-head simplified model: 4·d·L FLOPs over 2·d·L·B bytes → 2/B. See §05.
Embedding lookup
~0 F/B
MEMORY
Gather: one row fetched per token. Essentially pure memory traffic, near-zero compute.
Attention (short seq)
~4–16 F/B
MEMORY
Rough range; scales with seq len and Q/KV head ratio. Still BW-bound at inference.
B=512→~410 F/B. Ridge (~295 F/B) crossed around B≈345 for 4K×4K FP16. Tensor-core bound.
TAKEAWAY
With a theoretical ridge of ~295 F/B (H100 dense BF16/FP16), the tensor cores are idle for almost every transformer inference kernel. You're paying for 989 TFLOPS but only able to use them during large batched matrix multiplies. The rest of the time you're bottlenecked on the 3.35 TB/s pipe.
This is why FlashAttention matters — it eliminates the quadratic S/P intermediates from HBM, reducing total bytes moved without changing compute. And it's why KV quantization and MLA help: fewer bytes per element, same FLOPs, better effective AI.
05 / STREAMING DECODE ATTENTION — THE DEEP DIVE
Why streaming decode attention over the KV cache has ~1 F/B arithmetic intensity
During every decode step, the GPU streams the KV cache out of HBM while computing attention against it.
In a simplified per-query-head asymptotic model — which ignores fixed overheads like query loads, output writes,
metadata, and dequantization scales — the arithmetic intensity simplifies to a function of
bytes per element alone. Context length cancels. Head dimension cancels. Batch size cancels.
Whole-layer effective AI can also rise with head sharing (GQA/MQA) or KV compression (MLA) — but the single-head
baseline illuminates exactly what drives the bottleneck.
THE MATH — SIMPLIFIED ASYMPTOTIC MODEL (ONE QUERY HEAD)
For a single decode step, consider one attention head in one layer. The query vector q is a single row
of dimension d. The KV cache holds L previous tokens, each with a K and V vector of dimension d.
This model counts only the dominant K/V read traffic and the two matmuls — ignoring query loads, output writes,
scale factors, and other bookkeeping that become significant at short context lengths.
BYTES LOADED FROM HBM
K matrix: L × d × sizeof(dtype)
V matrix: L × d × sizeof(dtype) ──────────────────────
total = 2 · L · d · B
where B = bytes per element (2 for FP16, 1 for INT8)
FLOPS COMPUTED
scores: q @ Kᵀ → 2·d·L FLOPs
output: scores @ V → 2·d·L FLOPs ──────────────────────
total = 4 · d · L
softmax negligible; dominated by the two matmuls
ARITHMETIC INTENSITY — SINGLE HEAD, LARGE-CONTEXT ASYMPTOTE
AI = 4·d·L / (2·d·L·B) = 2 / B
per-head asymptote: L cancels · d cancels · only bytes-per-element B remains
WHOLE-LAYER EFFECTIVE AI ALSO SCALES WITH Q/KV HEAD RATIO
The single-head model above has AI = 2/B. But at the layer level, if Hq query heads
all attend over the same Hkv KV heads (GQA/MQA), the same K/V bytes serve multiple query heads.
Whole-layer AI is approximately:
AIlayer ≈ 2 · (Hq / Hkv) / B
MHA (Hq=Hkv=32): ~1 F/B (FP16) ·
GQA (32Q / 8KV): ~4 F/B ·
MQA (32Q / 1KV): ~32 F/B.
Still firmly memory-bound on any current GPU, but bytes per element alone is not the full story at the layer level.
FP32
0.5 F/B
B = 4 bytes
FP16 / BF16
1.0 F/B
B = 2 bytes · standard
INT8
2.0 F/B
B = 1 byte · 2× better
INT4
4.0 F/B
B = 0.5 bytes · 4× better
KEY RESULT
Even INT4 gives only 4 F/B per head — still 70–80× below the H100's dense BF16/FP16 theoretical ridge of ~295 F/B.
MLA and extreme GQA can push whole-layer effective AI higher, but decode attention remains
strongly bandwidth-limited in practice on current hardware.
Streaming Decode Attention Calculator — per decode step
KV format controls bytes-per-element in the stored KV cache, not necessarily the active compute precision of the attention kernel — quantized KV is often dequantized to a higher precision before the matmul. The H100 reference roofline (295 F/B) is always the dense BF16/FP16 tensor-core theoretical ceiling regardless of KV format selected.
MODEL PARAMETERS
4K
32
8
128
SERVING PARAMETERS
1
0%
AI (SINGLE HEAD)
1.0 F/B
asymptotic model · dtype only
IDEALIZED BATCH READ
—
shared-once model · K+V bytes
MIN TIME @ PEAK BW
—
lower bound at 3,350 GB/s
AGGREGATE TOKENS / SEC (MAX)
—
batch total · if KV load were only cost
BATCHING — WHY IT HELPS GEMM BUT NOT KV CACHE
GEMM — AI SCALES WITH BATCH (sublinearly)
Weight W is K×N, input batch is B×K, output is B×N.
More tokens share the same weight bytes — but the input and output bytes also grow with B.
FLOPs = 2·B·K·N
bytes = s·(K·N + B·K + B·N) AI = B·K·N / (K·N + B·(K+N)) → sublinear in B; saturates near ridge
Each user has their own KV cache in separate HBM regions. Batching N users loads
N separate blobs. FLOPs and bytes both scale with N — the ratio stays fixed.
bytes = N · 2·L·d·B
FLOPs = N · 4·d·L AI = 2/B (constant) → N cancels out entirely
AI VS BATCH SIZE — GEMM 4K×4K FP16 vs FP16 KV (H100 ridge=295 F/B)
PREFIX CACHING — THREE QUANTITIES TO KEEP SEPARATE
When multiple requests share an identical prefix — same system prompt, same document context — that prefix KV
can be computed once and the blocks reused.
This affects three different quantities in different ways, and confusing them leads to bad offloading analysis later:
RESIDENT KV FOOTPRINT
Shared prefix blocks are stored once. Reduces how much HBM capacity the KV cache occupies.
This is supported by prefix-caching runtimes built on block KV management.
IDEALIZED DECODE READ TRAFFIC
In an idealized simultaneous-batch kernel, shared prefix bytes might be read once across all users.
This is a valid model to explore — but it depends on batching and runtime implementation details.
PER-REQUEST AI — UNCHANGED
For a single request, the per-head asymptotic baseline AI formula (2/B) is unchanged.
Prefix sharing does not alter the single-request baseline AI — though in an idealized batched kernel
with true shared-read semantics, whole-batch effective FLOPs-per-byte for the shared portion can rise.
TWO RELATED BUT DISTINCT MECHANISMS
Block/page KV management (PagedAttention in vLLM): partitions the KV cache into fixed-size blocks
stored in non-contiguous physical memory, reducing fragmentation and enabling flexible allocation.
This is primarily a memory management technique.
Automatic prefix caching (vLLM's APC, RadixAttention in SGLang): uses hash-based reuse
to skip recomputing KV blocks when a new request shares a prefix with a prior one.
This saves prefill compute and resident KV footprint for the shared prefix.
Whether it also reduces HBM read traffic per decode step depends on the batching strategy and kernel implementation.
A 2,000-token system prompt shared across 10,000 users would otherwise require 10,000 copies of the same KV data in HBM.
With prefix caching, only one copy is stored — a massive footprint win.
Prefix caching can also help a single later request that arrives alone — it reuses already-computed prompt KV
blocks without requiring a simultaneous batch.
↗ THE OFFLOADING EXTENSION — THREE QUANTITIES
Understanding KV cache at the roofline level maps directly onto offloading analysis.
There are three distinct quantities — confusing them produces bad trade-off reasoning:
RESIDENT KV FOOTPRINT
How much KV data must exist somewhere — HBM, DRAM, SSD, or a remote tier.
Offloading reduces HBM pressure by moving cold KV blocks elsewhere.
HBM DECODE TRAFFIC
How many K/V bytes must be read per decode step from HBM.
This is what the AI formula models. Compression and head-sharing reduce it directly.
OFFLOADED TRANSFER TRAFFIC
How many bytes must cross PCIe, CXL, NVLink, or storage links on the decode critical path.
Offloading solves footprint but can move the bottleneck here instead.
Recent KV-offloading work identifies PCIe transfer latency as the dominant cost in many offloaded workloads.
THE REAL LEVERS — WHAT ACTUALLY HELPS
① KV QUANTIZATION
INT8 halves bytes → doubles AI to 2 F/B.
INT4 quarters bytes → 4 F/B.
Still far from the ridge but real speedup — you can saturate HBM with fewer tokens, or serve more users with the same bandwidth.
② MLA (MULTI-HEAD LATENT ATTN)
DeepSeek's approach: compress K and V into a lower-dimensional latent vector before caching.
DeepSeek-V2 reports a 93.3% KV-cache reduction relative to standard MHA.
Fewer bytes stored and loaded → effective AI improves proportionally.
③ GQA / MQA
Grouped-query and multi-query attention reduce the number of KV heads (e.g. 32Q heads sharing 8 KV heads).
Fewer KV heads means fewer bytes in the cache.
Widely used in modern families: Llama 3 and Mistral use GQA; later Gemma-family models adopt it as well (original Gemma 7B used MHA; Gemma 2B used MQA).
④ PREFIX CACHING
Doesn't improve AI per kernel — but eliminates redundant HBM storage and recomputation across requests.
At scale this is often the largest practical win.
Automatic Prefix Caching in vLLM and RadixAttention in SGLang implement prefix reuse;
PagedAttention provides the block-based KV management substrate that makes this practical.
HARD TRUTH
Common practical schemes — INT8 KV, modest GQA (32Q/8KV) — can move whole-layer effective AI
into the low tens of F/B. More aggressive MLA-style compression can go higher in principle.
Real kernels land lower once fixed overheads, non-dominant bytes, and dequantization cost are included.
Either way, still well below the H100 dense BF16/FP16 theoretical ridge of ~295 F/B.
Decode attention is strongly bandwidth-limited in practice on current hardware.
The ridge is not a realistic target — it's a reminder of how far away compute-bound territory is,
and why memory bandwidth is the dominant cost lever.
06 / FLASHATTENTION
FlashAttention: same math, far fewer bytes through HBM
Naive attention materializes the full n×n score matrix S and softmax result P in HBM — quadratic in sequence length.
FlashAttention avoids this entirely by tiling the computation so that S and P are computed block-by-block in on-chip SRAM,
consumed immediately, and never written to HBM. The inputs (Q, K, V) and output (O) still stream through HBM in both approaches.
FlashAttention computes the same exact attention result and has the same asymptotic O(n²d) workload,
but practical FLOP counts can differ modestly between implementations — the paper itself shows cases
where FlashAttention has a slightly higher FLOP count yet is dramatically faster because HBM accesses drop so much.
The win is fewer HBM accesses and no full S/P materialization.
BEFORE
Naive Attention
⬛ HBM — full n×n attention scores live here
all n² scores stored in HBM · grows quadratically
◻ SRAM — used within each kernel, but S/P intermediates spill to HBM between steps
full n×n attention matrices written to HBM before the next kernel reads them
# step 1 — write n×n to HBM
S = Q @ Kᵀ→ HBM
P = softmax(S)→ HBM
O = P @ V→ HBM
HBM: O(n²) — quadratic S/P intermediatesS, P materialized between kernels
→
AFTER
FlashAttention
✓ HBM — Q, K, V (inputs) + O (output)
Q
K
V
O
Only Q/K/V/O reside off-chip — S/P intermediates are not materialized in HBM
⚡ SRAM — tiled attention (constant size)
×
×
×
■ in SRAM now× done, discarded□ not yet
# one pass, tile at a time
forq_tileintile(Q):
forkv_tileintile(K,V):
accumulate(O)# stays in SRAM
HBM: O(n·d) — no S/P materializationreduced HBM accesses vs naive
KEY INSIGHT
The n×n attention matrix is ephemeral — computed tile-by-tile in fast on-chip SRAM, consumed immediately, then thrown away. It never gets written to HBM.
FlashAttention computes the exact same attention result. The win is far fewer HBM accesses — the quadratic intermediates simply never touch off-chip memory.
TABLE A — TEMPORARY SCORE MATERIALIZATION IN HBM (per head, FP16)
CONTEXT
NAIVE S+P
FLASH S+P
1K
4 MiB
not materialized
4K
64 MiB
not materialized
16K
1 GiB
not materialized
128K
64 GiB
not materialized
Naive: S=QKᵀ and P=softmax(S) written to HBM · size = 4·n² bytes/head at FP16
Flash: S/P computed tile-by-tile in SRAM, never written to HBM
Both naive and Flash must stream Q, K, V, O — this traffic is identical.
Formula: 8·n·d bytes/head (Q+K+V+O) · d=128 · FP16
CONNECTION TO RIDGE POINT
FlashAttention's win is eliminating the quadratic S and P intermediates from HBM.
At 4K context with d=128, naive attention writes 64 MB of S/P per head to HBM — bytes that don't exist at all in FlashAttention.
That is a substantial and real reduction in HBM traffic.
But both algorithms still read the same Q, K, V from HBM and write the same O back — about 4 MiB per head at 4K context.
FlashAttention doesn't change how many bytes the inputs and outputs require; it eliminates the wasteful intermediates.
That moves the workload's effective position on the roofline chart — more useful compute per byte actually moved.
The on-chip working SRAM is constant with respect to context length for a fixed tile configuration, but its exact
size depends on tile dimensions, head dimension, and kernel implementation — it is not a single universal constant.
07 / FULL ROOFLINE CHART
Read the roofline chart
Each line is one GPU's theoretical spec-sheet roofline using dense BF16/FP16 peaks — sustained measured throughput lies below these lines and is kernel-dependent.
The dot marks its ridge point. Everything left of that dot is memory-bound on that GPU. Everything right is compute-bound.
The arithmetic intensity [FLOP/byte] where your GPU transitions from bandwidth-bound to compute-bound. Computed as peak TFLOPS ÷ bandwidth (in TB/s) for the chosen compute ceiling — different precisions and kernel paths give different ridges on the same chip.
left of ridge
Memory-bound. Tensor cores are idle. Adding more FLOPs doesn't help — you need more bandwidth, or you need to move fewer bytes (compression, fusion).
right of ridge
Compute-bound. Bandwidth is underutilized. Adding bandwidth doesn't help — the FLOPs are the bottleneck.
at the ridge
Perfect balance. Both bandwidth and compute are fully utilized simultaneously. Almost never achieved in practice.
transformers
Mostly bandwidth-limited. Many common inference kernels — embedding gather (~0 F/B), streaming decode attention (~1–32 F/B depending on head sharing and dtype), GEMMs at batch=1 — sit far left of dense BF16/FP16 rooflines (~200–295 F/B). Large batched GEMMs are the main exception.