KV CACHE performance analysis

Ridge Point Implications for KV Cache Streaming Attention

A GPU has two fundamental limits: how fast it can move bytes, and how fast it can compute. Where those limits intersect is the ridge point. For KV cache streaming attention during decode, the ratio of compute to memory traffic — arithmetic intensity, measured in FLOPs per byte — stays near 1 F/B regardless of context length or model size. The workload is strongly bandwidth-limited, and understanding why requires looking at what the hardware is actually doing. These are theoretical spec-sheet rooflines; sustained measured performance is lower and kernel-dependent.

01 / FOUNDATION

Every kernel does two things: move bytes, then crunch numbers

Before any FLOPs can happen, data has to come from somewhere — HBM memory, L2 cache, system RAM. The GPU loads bytes, does math on them, and writes results back. That's it.

🗄️

HBM Memory

where data lives

bandwidth GB/s

⚡

Registers / L1

on-chip, fast

FLOPs TFLOPS

🧮

Tensor Cores

do the math

write back

🗄️

HBM Memory

results stored

KEY RATIO

For any kernel, you can measure how much math it does per byte it moves. This ratio is called Arithmetic Intensity:

AI = FLOPs ÷ bytes moved

units: FLOP/byte

To make the math tractable, we use a single-head asymptotic model: count only the K and V reads and the two attention matmuls for one head, and assume context length is long enough that fixed overheads like the query load and output write are negligible.

Two phases, two very different AIs. Transformer inference runs in two phases with fundamentally different memory-traffic characteristics, which is why comparing a single headline number to the ridge point can be misleading:

Prefill — the one-shot pass that processes the entire input prompt. It is dominated by large matrix multiplies (GEMMs) over thousands of tokens at once. These reuse each loaded weight across many tokens, so they operate at ~512 FLOPs per byte — close to the compute-bound ceiling.
Decode — the autoregressive loop that emits one output token at a time, attending against the growing KV cache. Each step streams the full KV cache out of HBM and does a tiny amount of math per byte read: ~1 FLOP per byte — deeply memory-bound.

The same model, the same hardware, the same attention kernel — but the 512× AI gap between prefill's big GEMMs and decode's per-token attention is the whole story of why transformer inference is hard to make fast and why prefill and decode need fundamentally different optimisation strategies.

01b / PHASE-SPLIT DEPLOYMENT

The AI gap is wide enough that production serving already splits prefill and decode onto different GPUs

Once you see prefill's ~512 FLOP/byte next to decode's ~1 FLOP/byte, running both phases on the same device at the same time looks obviously wrong — one phase wastes the bandwidth ceiling, the other wastes the compute ceiling. Multiple research groups and production systems reached the same conclusion independently and the industry is now moving toward disaggregated serving as the default pattern for large-scale inference.

PRIOR ART — THE FIRST WAVE

Splitwise (arXiv:2311.18677, 2023) and DistServe (arXiv:2401.09670, 2024) were the first wave. Both argued that prefill belongs on GPUs with strong tensor-core throughput and decode belongs on GPUs with strong memory-bandwidth-per-dollar; combining them on the same device forces one resource to sit idle while the other saturates. The natural move is to disaggregate — run the prompt through one GPU pool for prefill, hand the resulting KV cache off to a second pool for decode. 🎧 Audio summary covering both papers.

2026 DEPLOYMENT EVIDENCE — AGENTIC SCALE

DualPath (DeepSeek, arXiv:2602.21548, 2026) extends the phase-split pattern into agentic serving where the KV cache grows across multiple user turns and has to flow across a tiering hierarchy. It characterizes storage-side bandwidth pressure on the decode path — which is the same roofline analysis applied one level out, where the bandwidth ceiling stops being HBM and becomes NVLink / PCIe / CXL / storage. The paper confirms that the 512× AI gap does not close at scale; it compounds. 🎧 Audio summary.

EMPIRICAL BACKBONE — WHERE KERNELS ACTUALLY LAND

A Systematic Characterization of LLM Inference on GPUs (arXiv:2512.01644, 2025) provides the measurement spine for all of the above. It measures where real inference kernels actually land on the roofline across model scales and GPU generations, and confirms that decode sits deep in the memory-bound region — not by a narrow margin, by an order of magnitude. 🎧 Audio summary.

FOUNDATIONAL FRAMEWORK — WHERE THIS MENTAL MODEL COMES FROM

Roofline: An Insightful Visual Performance Model for Multicore Architectures — Williams, Waterman, Patterson (Berkeley, 2008). PDF. The original paper that introduced the visual framework this entire page is applying — arithmetic intensity on the x-axis, a flat compute ceiling and a diagonal bandwidth ceiling, kernels plotted as points against those ceilings, with the ridge point as the crossover.

Does roofline plus AI actually help the analysis? Yes — because it separates what is a property of the algorithm from what is a property of the hardware.

AI is an algorithm property. It is what the kernel computes, regardless of what hardware runs it. Streaming decode attention is ~1 FLOP/byte on an H100, on an A100, on a B200, on an MI300X, on anything. You cannot raise it without changing the algorithm.
Bandwidth and compute ceilings are hardware properties. They change every GPU generation.
Plotting a kernel as a single dot on the AI axis immediately tells you three things at once: (1) which ceiling you are hitting, (2) how much headroom exists to that ceiling, and (3) which lever — algorithmic change, kernel engineering, or hardware upgrade — is even capable of moving the bottleneck. Without roofline there is no principled way to decide whether the next improvement should be KV quantization, kernel fusion, prefill-decode disaggregation, a tiering architecture change, or a GPU upgrade — they each target a different axis of the chart.

That is why this page starts with the KEY RATIO — every downstream optimization in LLM serving (KV cache compression, asymmetric K/V precision, fused-decode kernels, prefill-decode disaggregation, cache tiering) is a move to either raise effective AI or lower the per-step bandwidth requirement, and roofline is the chart that tells you which one matters for the case in front of you.

02 / THE TWO CEILINGS

Your GPU has two speed limits. You always hit the lower one.

No matter how many FLOPs the spec sheet claims, actual performance is bounded by whichever resource runs out first — bandwidth or compute.

BANDWIDTH CEILING

Memory-Bound

The GPU can't compute faster than bytes arrive. If it's starved for data, the tensor cores sit idle waiting.

Performance = BW × AI
More bandwidth or higher AI → goes faster.

perf = BW [GB/s] × AI [F/B] ÷ 1000

H100 bandwidth · 3350 GB/s

COMPUTE CEILING

Compute-Bound

The GPU can't compute faster than its tensor cores allow. At some point data arrives faster than it can be processed.

Performance = peak TFLOPS
Fixed ceiling. More AI doesn't help.

perf = peak [TFLOPS] (flat ceiling)

H100 dense BF16/FP16 · ~989 TFLOPS (inferred as ½ sparse)

ACTUAL PERFORMANCE = min(bandwidth ceiling, compute ceiling)

BW ceiling

—

compute ceiling

—

actual perf

—

03 / THE RIDGE POINT

The ridge point: where the two ceilings meet

There is exactly one arithmetic intensity where BW × AI = peak TFLOPS. That crossover is the ridge point. Below it you're memory-bound. Above it you're compute-bound. And since AI is a property of the algorithm — not a free dial — most kernels sit in a fixed region of this chart.

Pick an illustrative kernel profile — AI values are rough didactic ranges, not measured data

AI is not a dial you turn. It's a property of what the kernel computes. Streaming decode attention over the KV cache lands near ~1 F/B (FP16) in a simplified per-head model — you can't wish it higher without changing the algorithm. Click any profile below to see where it lands on the H100 dense BF16/FP16 theoretical roofline.

KERNEL AI

—

H100 RIDGE

295 F/B

dense BF16/FP16 TC ref

ACTUAL PERF

—

SO HOW DO YOU ACTUALLY INCREASE AI?

You can't change arithmetic intensity by flipping a switch. It requires redesigning the kernel. There are three real levers:

① OPERATOR FUSION

Fuse multiple kernels into one pass so intermediate results stay in registers/L1 instead of writing back to HBM.

FlashAttention: tiles the QKᵀ matmul, softmax, and AV matmul into a single kernel pass, keeping intermediate scores in on-chip SRAM. The quadratic S/P matrices are never written to HBM — substantially reducing HBM traffic for long sequences.

② INCREASE BATCH SIZE

For GEMMs, larger batches reuse the same weight bytes across more tokens. For a 4096×4096 FP16 GEMM: B=1 → ~1 F/B, B=16 → ~16 F/B, B=64 → ~62 F/B, B=512 → ~410 F/B. Scaling is sublinear — input and output bytes also grow with B. Ridge crossed around B≈345 for this shape.

This is the only free lunch: if you can batch more requests together, GEMM AI climbs toward the ridge. But KV cache AI stays fixed regardless of batch size.

③ COMPRESS THE BYTES

Same FLOPs, fewer bytes moved → AI goes up. This is what KV quantization and MLA do.

4× KV compression → 4× higher AI in the single-head streaming model. Whole-layer AI with GQA also improves. Doesn't push past the ridge, but meaningfully reduces total HBM traffic.

THE HARD TRUTH Even with all three techniques combined, most transformer inference kernels still never reach the ridge point (~200–295 F/B on modern GPUs). Common practical compression and head-sharing schemes can move whole-layer effective AI into the low tens of F/B; more aggressive MLA-style compression can go higher in principle but real kernels land lower once overheads are counted. The ridge is a theoretical ceiling, not a realistic target for these workloads.

FORMULA

ridge [F/B] = peak TFLOPS ÷ bandwidth [TB/s]

dense BF16/FP16 theoretical · spec-sheet peaks · sustained measured rooflines are lower

H100

989 ÷ 3.35 = 295

H200

989 ÷ 4.80 = 206

B200 (HGX)

2250 ÷ 8.0 = 281

MI300X

1307 ÷ 5.30 = 247

⚠ A GPU does not have one universal ridge point. The ridge depends on which compute ceiling you use. H100 tensor-core FP16 gives ~295 F/B. H100 FP32 vector throughput gives ~20 F/B. Non-tensor-core kernels (layernorm, softmax, gather) cannot use the full tensor-core ceiling, so their effective ridge is much lower.

04 / TRANSFORMER WORKLOADS

Where do actual transformer kernels land on this chart?

Most transformer operations have very low arithmetic intensity — they spend most time waiting on bandwidth, not computing. Values below are illustrative rough ranges; actual measurements vary by implementation, hardware, and batch size. The ridge shown is H100 dense BF16/FP16 theoretical (~295 F/B). Non-tensor-core kernels have lower effective ridges.

KERNEL	ARITH INTENSITY	REGIME	WHY
Streaming decode attention (KV)	~1 F/B (FP16)	MEMORY	Per-head simplified model: 4·d·L FLOPs over 2·d·L·B bytes → 2/B. See §05.
Embedding lookup	~0 F/B	MEMORY	Gather: one row fetched per token. Essentially pure memory traffic, near-zero compute.
Attention (short seq)	~4–16 F/B	MEMORY	Rough range; scales with seq len and Q/KV head ratio. Still BW-bound at inference.
MoE routing	~16–32 F/B	MEMORY	Sparse scatter/gather, low reuse. Illustrative range.
LayerNorm / SoftMax	~10–30 F/B (rough)	MEMORY	Multiple passes over activations. Cannot use tensor-core ceiling; effective ridge is lower.
GEMM (small batch)	~1–16 F/B	MEMORY	B=1→~1 F/B, B=16→~16 F/B (4K×4K FP16). Weight bytes dominate at low batch.
GEMM (large batch)	~256–410 F/B	COMPUTE	B=512→~410 F/B. Ridge (~295 F/B) crossed around B≈345 for 4K×4K FP16. Tensor-core bound.

TAKEAWAY

With a theoretical ridge of ~295 F/B (H100 dense BF16/FP16), the tensor cores are idle for almost every transformer inference kernel. You're paying for 989 TFLOPS but only able to use them during large batched matrix multiplies. The rest of the time you're bottlenecked on the 3.35 TB/s pipe.

This is why FlashAttention matters — it eliminates the quadratic S/P intermediates from HBM, reducing total bytes moved without changing compute. And it's why KV quantization and MLA help: fewer bytes per element, same FLOPs, better effective AI.

05 / STREAMING DECODE ATTENTION — THE DEEP DIVE

Why streaming decode attention over the KV cache has ~1 F/B arithmetic intensity

During every decode step, the GPU streams the KV cache out of HBM while computing attention against it. In a simplified per-query-head asymptotic model — which ignores fixed overheads like query loads, output writes, metadata, and dequantization scales — the arithmetic intensity simplifies to a function of bytes per element alone. Context length cancels. Head dimension cancels. Batch size cancels. Whole-layer effective AI can also rise with head sharing (GQA/MQA) or KV compression (MLA) — but the single-head baseline illuminates exactly what drives the bottleneck.

SETUP — WHY DECODE'S ATTENTION IS STRUCTURALLY SIMPLER THAN PREFILL'S

The attention math itself is the same operation in both phases — softmax(QK^T/√d)V. What changes between prefill and decode is the shape of Q:

Prefill processes the whole input prompt at once. Q is a full matrix [T, d] where T is the prompt length — every prompt token attends against every other prompt token, producing the initial KV cache as a side effect. That is a T × T attention pattern with real data-reuse and it runs as a big GEMM close to the compute ceiling.
Decode emits one new token at a time. The only Q row we need is the single query vector q for the new token — shape [1, d]. We never re-materialise Q for the earlier tokens, and we never recompute K and V for the earlier tokens either: those are already sitting in the KV cache from prefill (and from every prior decode step). That is exactly what the KV cache buys you — turning an O(T²) attention pass per step into a single-row q · K^T against L cached token vectors.

So when the model below computes attention for "one decode step, one head", it is counting only the work for that single new q row against the L previous K/V vectors. That is the workload the GPU actually streams out of HBM every generated token, and its shape is what forces the arithmetic intensity down to ~1 F/B. Without the KV cache, decode would instead look like prefill — but repeated once per generated token, which is why the cache exists in the first place.

THE MATH — SIMPLIFIED ASYMPTOTIC MODEL (ONE QUERY HEAD)

For a single decode step, consider one attention head in one layer. The query vector q is a single row of dimension d. The KV cache holds L previous tokens, each with a K and V vector of dimension d. This model counts only the dominant K/V read traffic and the two matmuls — ignoring query loads, output writes, scale factors, and other bookkeeping that become significant at short context lengths.

BYTES LOADED FROM HBM

K matrix: L × d × sizeof(dtype)
V matrix: L × d × sizeof(dtype)
──────────────────────
total = 2 · L · d · B

where B = bytes per element (2 for FP16, 1 for INT8)

FLOPS COMPUTED

scores: q @ Kᵀ → 2·d·L FLOPs
output: scores @ V → 2·d·L FLOPs
──────────────────────
total = 4 · d · L

softmax negligible; dominated by the two matmuls

ARITHMETIC INTENSITY — SINGLE HEAD, LARGE-CONTEXT ASYMPTOTE

AI = 4·d·L / (2·d·L·B) = 2 / B

per-head asymptote: L cancels · d cancels · only bytes-per-element B remains

WHOLE-LAYER EFFECTIVE AI ALSO SCALES WITH Q/KV HEAD RATIO

The single-head model above has AI = 2/B. But at the layer level, if H_q query heads all attend over the same H_kv KV heads (GQA/MQA), the same K/V bytes serve multiple query heads. Whole-layer AI is approximately:

AI_layer ≈ 2 · (H_q / H_kv) / B

MHA (H_q=H_kv=32): ~1 F/B (FP16) · GQA (32Q / 8KV): ~4 F/B · MQA (32Q / 1KV): ~32 F/B. Still firmly memory-bound on any current GPU, but bytes per element alone is not the full story at the layer level.

FP32

0.5 F/B

B = 4 bytes

FP16 / BF16

1.0 F/B

B = 2 bytes · standard

INT8

2.0 F/B

B = 1 byte · 2× better

INT4

4.0 F/B

B = 0.5 bytes · 4× better

KEY RESULT Even INT4 gives only 4 F/B per head — still 70–80× below the H100's dense BF16/FP16 theoretical ridge of ~295 F/B. MLA and extreme GQA can push whole-layer effective AI higher, but decode attention remains strongly bandwidth-limited in practice on current hardware.

Streaming Decode Attention Calculator — per decode step

KV format controls bytes-per-element in the stored KV cache, not necessarily the active compute precision of the attention kernel — quantized KV is often dequantized to a higher precision before the matmul. The H100 reference roofline (295 F/B) is always the dense BF16/FP16 tensor-core theoretical ceiling regardless of KV format selected.

MODEL PARAMETERS

Context len 4K

Layers 32

KV heads 8

Head dim 128

SERVING PARAMETERS

Batch size 1

Prefix shared 0%

KV format

AI (SINGLE HEAD)

1.0 F/B

asymptotic model · dtype only

IDEALIZED BATCH READ

—

shared-once model · K+V bytes

MIN TIME @ PEAK BW

—

lower bound at 3,350 GB/s

AGGREGATE TOKENS / SEC (MAX)

—

batch total · if KV load were only cost

BATCHING — WHY IT HELPS GEMM BUT NOT KV CACHE

GEMM — AI SCALES WITH BATCH (sublinearly)

Weight W is K×N, input batch is B×K, output is B×N. More tokens share the same weight bytes — but the input and output bytes also grow with B.

FLOPs = 2·B·K·N
bytes = s·(K·N + B·K + B·N)
AI = B·K·N / (K·N + B·(K+N))
→ sublinear in B; saturates near ridge

K=N=4096, FP16 (s=2):
AI ≈ B / (1 + B/2048)
B=1 → ~1 F/B · B=16 → ~16 F/B
B=64 → ~62 F/B · B=512 → ~410 F/B
ridge crossed at B≈345

KV CACHE — AI DOES NOT SCALE WITH BATCH

Each user has their own KV cache in separate HBM regions. Batching N users loads N separate blobs. FLOPs and bytes both scale with N — the ratio stays fixed.

bytes = N · 2·L·d·B
FLOPs = N · 4·d·L
AI = 2/B (constant)
→ N cancels out entirely

AI VS BATCH SIZE — GEMM 4K×4K FP16 vs FP16 KV (H100 ridge=295 F/B)

PREFIX CACHING — THREE QUANTITIES TO KEEP SEPARATE

When multiple requests share an identical prefix — same system prompt, same document context — that prefix KV can be computed once and the blocks reused. This affects three different quantities in different ways, and confusing them leads to bad offloading analysis later:

RESIDENT KV FOOTPRINT

Shared prefix blocks are stored once. Reduces how much HBM capacity the KV cache occupies. This is supported by prefix-caching runtimes built on block KV management.

IDEALIZED DECODE READ TRAFFIC

In an idealized simultaneous-batch kernel, shared prefix bytes might be read once across all users. This is a valid model to explore — but it depends on batching and runtime implementation details.

PER-REQUEST AI — UNCHANGED

For a single request, the per-head asymptotic baseline AI formula (2/B) is unchanged. Prefix sharing does not alter the single-request baseline AI — though in an idealized batched kernel with true shared-read semantics, whole-batch effective FLOPs-per-byte for the shared portion can rise.

TWO RELATED BUT DISTINCT MECHANISMS

Block/page KV management (PagedAttention in vLLM): partitions the KV cache into fixed-size blocks stored in non-contiguous physical memory, reducing fragmentation and enabling flexible allocation. This is primarily a memory management technique.

Automatic prefix caching (vLLM's APC, RadixAttention in SGLang): uses hash-based reuse to skip recomputing KV blocks when a new request shares a prefix with a prior one. This saves prefill compute and resident KV footprint for the shared prefix. Whether it also reduces HBM read traffic per decode step depends on the batching strategy and kernel implementation.

HBM LAYOUT — RESIDENT STORED BLOCKS (not per-step read traffic)

HBM FOOTPRINT — NO SHARING

N × full KV blocks

each request stores its own prefix

HBM FOOTPRINT — WITH SHARED BLOCKS

1 shared + N × private

prefix blocks stored once in HBM

HBM CAPACITY SAVED

—

varies with prefix % and batch

A 2,000-token system prompt shared across 10,000 users would otherwise require 10,000 copies of the same KV data in HBM. With prefix caching, only one copy is stored — a massive footprint win. Prefix caching can also help a single later request that arrives alone — it reuses already-computed prompt KV blocks without requiring a simultaneous batch.

↗ THE OFFLOADING EXTENSION — THREE QUANTITIES

Understanding KV cache at the roofline level maps directly onto offloading analysis. There are three distinct quantities — confusing them produces bad trade-off reasoning:

RESIDENT KV FOOTPRINT

How much KV data must exist somewhere — HBM, DRAM, SSD, or a remote tier. Offloading reduces HBM pressure by moving cold KV blocks elsewhere.

HBM DECODE TRAFFIC

How many K/V bytes must be read per decode step from HBM. This is what the AI formula models. Compression and head-sharing reduce it directly.

OFFLOADED TRANSFER TRAFFIC

How many bytes must cross PCIe, CXL, NVLink, or storage links on the decode critical path. Offloading solves footprint but can move the bottleneck here instead. Recent KV-offloading work identifies PCIe transfer latency as the dominant cost in many offloaded workloads.

THE REAL LEVERS — WHAT ACTUALLY HELPS

① KV QUANTIZATION

INT8 halves bytes → doubles AI to 2 F/B.
INT4 quarters bytes → 4 F/B.
Still far from the ridge but real speedup — you can saturate HBM with fewer tokens, or serve more users with the same bandwidth.

② MLA (MULTI-HEAD LATENT ATTN)

DeepSeek's approach: compress K and V into a lower-dimensional latent vector before caching. DeepSeek-V2 reports a 93.3% KV-cache reduction relative to standard MHA. Fewer bytes stored and loaded → effective AI improves proportionally.

③ GQA / MQA

Grouped-query and multi-query attention reduce the number of KV heads (e.g. 32Q heads sharing 8 KV heads). Fewer KV heads means fewer bytes in the cache. Widely used in modern families: Llama 3 and Mistral use GQA; later Gemma-family models adopt it as well (original Gemma 7B used MHA; Gemma 2B used MQA).

④ PREFIX CACHING

Doesn't improve AI per kernel — but eliminates redundant HBM storage and recomputation across requests. At scale this is often the largest practical win. Automatic Prefix Caching in vLLM and RadixAttention in SGLang implement prefix reuse; PagedAttention provides the block-based KV management substrate that makes this practical.

HARD TRUTH Common practical schemes — INT8 KV, modest GQA (32Q/8KV) — can move whole-layer effective AI into the low tens of F/B. More aggressive MLA-style compression can go higher in principle. Real kernels land lower once fixed overheads, non-dominant bytes, and dequantization cost are included. Either way, still well below the H100 dense BF16/FP16 theoretical ridge of ~295 F/B. Decode attention is strongly bandwidth-limited in practice on current hardware. The ridge is not a realistic target — it's a reminder of how far away compute-bound territory is, and why memory bandwidth is the dominant cost lever.

06 / FLASHATTENTION

FlashAttention: same math, far fewer bytes through HBM

Naive attention materializes the full n×n score matrix S and softmax result P in HBM — quadratic in sequence length. FlashAttention avoids this entirely by tiling the computation so that S and P are computed block-by-block in on-chip SRAM, consumed immediately, and never written to HBM. The inputs (Q, K, V) and output (O) still stream through HBM in both approaches. FlashAttention computes the same exact attention result and has the same asymptotic O(n²d) workload, but practical FLOP counts can differ modestly between implementations — the paper itself shows cases where FlashAttention has a slightly higher FLOP count yet is dramatically faster because HBM accesses drop so much. The win is fewer HBM accesses and no full S/P materialization.

BEFORE

Naive Attention

⬛ HBM — full n×n attention scores live here

all n² scores stored in HBM · grows quadratically

◻ SRAM — used within each kernel, but S/P intermediates spill to HBM between steps

full n×n attention matrices written to HBM before the next kernel reads them

# step 1 — write n×n to HBM

S = Q @ Kᵀ → HBM

P = softmax(S) → HBM

O = P @ V → HBM

HBM: O(n²) — quadratic S/P intermediates S, P materialized between kernels

→

AFTER

FlashAttention

✓ HBM — Q, K, V (inputs) + O (output)

Only Q/K/V/O reside off-chip — S/P intermediates are not materialized in HBM

⚡ SRAM — tiled attention (constant size)

■ in SRAM now × done, discarded □ not yet

# one pass, tile at a time

for q_tile in tile(Q):

for kv_tile in tile(K,V):

accumulate(O) # stays in SRAM

HBM: O(n·d) — no S/P materialization reduced HBM accesses vs naive

KEY INSIGHT

The n×n attention matrix is ephemeral — computed tile-by-tile in fast on-chip SRAM, consumed immediately, then thrown away. It never gets written to HBM. FlashAttention computes the exact same attention result. The win is far fewer HBM accesses — the quadratic intermediates simply never touch off-chip memory.

TABLE A — TEMPORARY SCORE MATERIALIZATION IN HBM (per head, FP16)

CONTEXT	NAIVE S+P	FLASH S+P
1K	4 MiB	not materialized
4K	64 MiB	not materialized
16K	1 GiB	not materialized
128K	64 GiB	not materialized

Naive: S=QKᵀ and P=softmax(S) written to HBM · size = 4·n² bytes/head at FP16
Flash: S/P computed tile-by-tile in SRAM, never written to HBM

TABLE B — ESSENTIAL Q/K/V/O HBM TRAFFIC (per head, FP16, d=128)

CONTEXT	Q+K+V+O	K+V only
1K	1 MiB	512 KiB
4K	4 MiB	2 MiB
16K	16 MiB	8 MiB
128K	128 MiB	64 MiB

Both naive and Flash must stream Q, K, V, O — this traffic is identical.
Formula: 8·n·d bytes/head (Q+K+V+O) · d=128 · FP16

CONNECTION TO RIDGE POINT

FlashAttention's win is eliminating the quadratic S and P intermediates from HBM. At 4K context with d=128, naive attention writes 64 MB of S/P per head to HBM — bytes that don't exist at all in FlashAttention. That is a substantial and real reduction in HBM traffic.

But both algorithms still read the same Q, K, V from HBM and write the same O back — about 4 MiB per head at 4K context. FlashAttention doesn't change how many bytes the inputs and outputs require; it eliminates the wasteful intermediates. That moves the workload's effective position on the roofline chart — more useful compute per byte actually moved.

The on-chip working SRAM is constant with respect to context length for a fixed tile configuration, but its exact size depends on tile dimensions, head dimension, and kernel implementation — it is not a single universal constant.

07 / FULL ROOFLINE CHART

Read the roofline chart

Each line is one GPU's theoretical spec-sheet roofline using dense BF16/FP16 peaks — sustained measured throughput lies below these lines and is kernel-dependent. The dot marks its ridge point. Everything left of that dot is memory-bound on that GPU. Everything right is compute-bound.

LOG/LOG SCALE · HOVER RIDGE DOT · CLICK CHIP TO TOGGLE

memory-bound slope

compute ceiling

ridge point

TL;DR — Ridge Point in Plain English

ridge point

The arithmetic intensity [FLOP/byte] where your GPU transitions from bandwidth-bound to compute-bound. Computed as peak TFLOPS ÷ bandwidth (in TB/s) for the chosen compute ceiling — different precisions and kernel paths give different ridges on the same chip.

left of ridge

Memory-bound. Tensor cores are idle. Adding more FLOPs doesn't help — you need more bandwidth, or you need to move fewer bytes (compression, fusion).

right of ridge

Compute-bound. Bandwidth is underutilized. Adding bandwidth doesn't help — the FLOPs are the bottleneck.

at the ridge

Perfect balance. Both bandwidth and compute are fully utilized simultaneously. Almost never achieved in practice.

transformers

Mostly bandwidth-limited. Many common inference kernels — embedding gather (~0 F/B), streaming decode attention (~1–32 F/B depending on head sharing and dtype), GEMMs at batch=1 — sit far left of dense BF16/FP16 rooflines (~200–295 F/B). Large batched GEMMs are the main exception.

SOURCES

01 Williams et al. — Roofline Model (EECS Berkeley, 2009) 02 NVIDIA — H100 / H200 GPU Spec Sheet 03 NVIDIA — HGX B200 Platform Specs 04 AMD — MI300X Instinct Accelerator Specs 05 Dao et al. — FlashAttention (arXiv 2205.14135) 06 vLLM — Automatic Prefix Caching Design Docs 07 DeepSeek-AI — DeepSeek-V2 (arXiv 2405.04434) 08 Meta AI — Introducing Llama 3 (GQA reference) 09 Mistral AI — Mistral 7B Announcement (GQA reference) 10 Google — Gemma: Open Models (arXiv 2312.11805) 11 Google — Gemma 2 Technical Report (arXiv 2408.00118) 12 Patel et al. — KV Offloading Bottlenecks (arXiv 2601.19910)