Cartridges

Pre-trained KV caches that skip prefill entirely. Train once per document, inject at serving time, eliminate the prefill bottleneck. Based on arxiv:2508.17032. 55 configurations validated across 3 architectures and 6 documents on H100 and W7900. Production connector with multi-cartridge routing, GPU residency, LMCache tiering, and SQLite registry.

Documentation · vllm branch · knlp

Overview

Prefill Speedup

Quality

Prefix Tuning

vLLM Integration

Cross-Model

Validation

Reproduce

What is a cartridge?

A cartridge replaces runtime prefill with a pre-trained KV cache. Instead of processing a document through the model at every request, the KV state is trained once offline and injected directly into GPU memory as a CPU→GPU memcpy.

10.6x

Prefill speedup at 16K tokens

Configurations tested

Production modules (~2800 LOC)

Compression sweet spot

185

Unit tests passing

How it works

Document

→

Offline training (128 steps)

→

Cartridge .pt file

→

CPU→GPU inject

→

Skip prefill, decode directly

Why GQA matters for cartridges

Checkpoint size is dominated by KV head count. At p=256:

Model	KV Heads	Checkpoint	At p=8192
Qwen2.5-7B (GQA)	4	15 MB	470 MB
Qwen2.5-1.5B (GQA)	2	7 MB	224 MB
Llama-2-7b (MHA)	32	128 MB	4,856 MB

GQA models produce 8.5x smaller cartridges than MHA models at the same prefix length. GQA is strongly preferred for cartridge deployment.

Prefill speedup

Cartridge prefill time scales sub-linearly with prefix length because loading pre-computed KV (memcpy) is cheaper than running a forward pass over p tokens. Measured on Llama-3.2-1B-Instruct, AMD W7900, FlexAttention + torch.compile.

Prefix tokens	ICL prefill	Cartridge prefill	Speedup
256	0.037s	0.038s	1.0x
512	0.050s	0.040s	1.3x
1,024	0.072s	0.041s	1.8x
2,048	0.138s	0.048s	2.9x
4,096	0.312s	0.071s	4.4x
8,192	0.758s	0.110s	6.9x
16,384	2.027s	0.192s	10.6x

End-to-end throughput (tok/s)

At p=16,384: Cartridge achieves 44.5 tok/s vs ICL at 27.4 tok/s (1.6x end-to-end). The speedup comes entirely from prefill savings; decode throughput is identical.

vLLM TTFT on H100

On H100, prefill at 4K tokens takes only ~80µs of GPU time. TTFT is dominated by Python/HTTP overhead (~5ms), vLLM scheduling (~10ms), and network latency (~3ms). The cartridge advantage manifests at longer prefixes, lower-end GPUs, or higher concurrency.

Quality evaluation

LLM-as-judge: 10 factual questions per document, each answer scored 0–5. Instruct models self-judge; Llama-2-7b answers judged by Qwen-7B.

Qwen2.5-7B-Instruct

Document	No prefix	Full	4x comp.	16x comp.	p=256
ML Research (8K)	1.2	4.5	3.3	3.4	3.1
Llama 2 paper (20K)	2.7	3.7	2.9	2.1	2.5
GPL v3 (7K)	2.9	3.8	2.8	2.3	2.4
Clinical (12K)	2.5	3.9	4.1	3.7	3.3
RFC 9110 (49K)	2.5	—	4.1	3.9	3.7
Wikipedia (2K)	3.6	4.4	3.8	4.0	3.7

4x compression is the sweet spot. Quality retention is typically 75–100% of the full cartridge across all tested architectures. Instruct models compress well; base models struggle at 4x+ compression.

Key findings

Instruct tuning is critical. At p=256, instruct models score 2.4–4.0 while the base model scores 0.0–3.2 with high variance.

Model size matters less than expected. Qwen-1.5B (1.5B params) often matches Qwen-7B (7B params), suggesting cartridge training adapts well to smaller models.

Prefix Tuning

Cartridges build on Prefix Tuning (Li and Liang, 2021). The core idea: prepend p trainable continuous vectors to the key and value matrices at every attention layer. These vectors are optimized via gradient descent while the model weights stay frozen.

Initialization strategy matters

Li and Liang found that initialization strongly affects final quality:

Strategy	Description	Quality
Random Gaussian	Sample from N(0, 0.02)	Worst — optimizer starts far from viable KV space
Vocabulary embeddings	Sample from model's word embedding matrix	Better — starts in a region the model recognizes
Real activations (First-k)	Run the actual document through the model, take first p KV vectors	Best — warm start in the exact KV space the model produces

First-k initialization (used by cartridges) is the natural extension: the initial KV state comes from actually processing the document, giving the optimizer a warm start in a region of KV space the model already understands. The optimizer then refines these vectors to compress the full document's information into p positions.

From fine-tuning to inference optimization

Original prefix tuning was a fine-tuning technique: train task-specific prefixes to adapt a frozen model. Cartridges repurpose it as an inference optimization: the trained prefix is the document representation, and injecting it replaces prefill entirely.

Prefix Tuning (2021)

→

Task-specific soft prompts

→

Cartridges (2025)

→

Document-specific KV caches

SCI vs First-k

SCI (Sampled Chunk Initialization) samples random 64-token chunks from across the document instead of taking the first p tokens. In theory, this captures broader content. In practice, both strategies produce nearly identical results:

Prefix	First-k (s)	SCI (s)	E2E First-k (tok/s)	E2E SCI (tok/s)
256	661	682	59.5	60.0
4,096	1,144	1,131	56.5	55.9
16,384	2,684	2,610	44.5	44.2

SCI's broader initialization does not translate into measurable quality or performance advantage over First-k.

vLLM integration (v0.4)

CartridgeConnector is a KVConnectorBase_V1 plugin built as a production stack of seven modules (~2,800 lines). It supports single-cartridge (backward-compatible) and multi-cartridge (per-request dispatch) modes with no vLLM core modifications.

Production stack

Module	Responsibility
CartridgeConnector	KVConnectorBase_V1 plugin. Hooks into the V1 scheduler via get_num_new_matched_tokens() and materialises KV via triton_reshape_and_cache_flash.
CartridgeManifest	Per-cartridge metadata + strict format validation. Names the model, layout, quantisation, and provenance.
CartridgeRegistry	SQLite-backed manifest index. Registry errors fall back to full prefill; a bad cartridge never poisons the serving path.
CartridgeStore	Read-only chunked KV access with residency bookkeeping. Payload stored separately so manifests can be scanned cheaply.
CartridgeGPUResidency	LRU eviction + pinning across a bounded GPU budget (default 8× largest cartridge or 1 GiB). Hot cartridges stay resident; cold ones spill.
CartridgeLMCachePlugin	Implements the LMCache StoragePluginInterface (read-only, min LMCache v0.3.13). Cartridges flow through the existing tiered KV offload without patches.
CartridgeRouter	Per-request dispatch. Explicit ids, named labels, static mapping, and composite chains. Carries cartridge_id end-to-end through connector metadata to the worker.

Request lifecycle

1. Request arrives

→

2. Router resolves cartridge_id

→

3. Registry looks up manifest

→

4. Store loads KV (GPU-resident or fetch)

→

5. Scheduler allocates blocks

→

6. Connector scatters KV to paged cache

→

7. Decode (skip prefill)

Singleton mode (backward compatible)

vllm serve Qwen/Qwen2.5-7B-Instruct \
  --kv-transfer-config '{
    "kv_connector": "CartridgeConnector",
    "kv_connector_module_path":
      "vllm.distributed.kv_transfer.kv_connector.v1.cartridge_connector",
    "kv_connector_extra_config": {
      "cartridge_path": "/path/to/cartridge.pt"
    },
    "kv_role": "kv_both"
  }'

Multi-cartridge mode

vllm serve Qwen/Qwen2.5-7B-Instruct \
  --kv-transfer-config '{
    "kv_connector": "CartridgeConnector",
    "kv_connector_module_path":
      "vllm.distributed.kv_transfer.kv_connector.v1.cartridge_connector",
    "kv_connector_extra_config": {
      "cartridges": {
        "medical":  "/data/cartridges/medical.pt",
        "legal":    "/data/cartridges/legal.pt",
        "clinical": "/data/cartridges/clinical.pt"
      },
      "router": {"type": "label", "field": "cartridge_id"}
    },
    "kv_role": "kv_both"
  }'

Two concurrent requests can inject different cartridges into different allocated blocks without cross-contamination. The router carries cartridge_id end-to-end from the scheduler to the worker side of the connector.

W7900 validation

Phase	What it tests	Result
Phase 0	Single-cartridge load + KV injection smoke	PASS
Phase 0.5	LMCache StoragePluginInterface integration	PASS
Phase 1	Multi-cartridge routing (3 cartridges, same model)	PASS
Phase 2	TTFT benchmark: cartridge vs dense prefill	7.0x p50 speedup
Phase 3	Multi-cartridge routing dispatch (20 cartridges)	PASS · 57% hit at 5/20 capacity
Phase 4	GPU residency stress under pressure	PASS

ROCm build profile

# Minimal build for cartridge-only serving on W7900 / gfx1100
VLLM_BUILD_PROFILE=cartridges_rocm pip install -e .

Builds only the extensions a cartridge-serving path needs (vllm._C, vllm._rocm_C, cumem_allocator). Includes a gfx1100 sampler thread-count fix and a gfx1x skinny_gemm wave32 workaround.

Design invariant: CartridgeConnector does not allocate KV memory. vLLM still owns allocation via BlockPool. The connector only fills the slots vLLM assigned. This integrates cleanly with the scheduler, continuous batching, and prefix caching.

Cross-model analysis

Property	Qwen-7B	Qwen-1.5B	Llama-2-7B
Parameters	7B	1.5B	7B
KV Heads	4 (GQA)	2 (GQA)	32 (MHA)
p=256 checkpoint	15 MB	7 MB	128 MB
p=8192 checkpoint	470 MB	224 MB	4,856 MB
Best full score	4.5	4.5	4.3
Avg p=256 score	3.1	3.5	1.6

Key findings

GQA models produce dramatically smaller cartridges. 8.5x smaller than MHA at the same prefix. GQA is strongly preferred.

Instruct tuning is critical for compression. Base Llama-2-7b scores 0.0–3.2 at 4x+ compression vs 2.4–4.1 for instruct models.

Small instruct models are surprisingly effective. Qwen-1.5B (1.5B params) often matches or exceeds Qwen-7B (7B params) at compressed cartridge sizes.

Concurrency scaling (H100)

Prefix	Conc=1	Conc=4	Conc=16	Output TPS (c=1)
4,096	31ms	43ms	75ms	166
8,192	41ms	69ms	127ms	163
16,384	78ms	127ms	245ms	158
32,768	56ms	136ms	261ms	158

CartridgeConnector Serving Validation

Full validation of the 20260429-cartridges-code-only branch. Validates cartridge KV injection, per-request dispatch, manifest validation, GPU residency, and serving correctness.

178

Unit tests passed

10/10

Context questions correct

16/16

Concurrent requests

50/50

Stability requests

Bugs found & fixed

Tier results

Tier	Test	Method	Result
-1	Unit / static	`pytest tests/v1/core/test_cartridge*.py`	178/178 PASS
0	Single-cart smoke	`vllm serve` + CartridgeConnector + 10 requests	PASS
0.5	Prefill canary	HF model.forward() vs cartridge KV — top-1 token match, logit diff 0.16	PASS
1	Context verification	10 factual medical questions via chat API	10/10 PASS
1b	Generated token logprobs	`/v1/completions` with `logprobs=1`	PASS
2	16 concurrent requests	8-thread pool, same cartridge	16/16 PASS
3	Multi-cart dispatch	3 carts, ExplicitCartridgeRouter, per-request routing	PASS
3 (cross)	No cross-contamination	Ask cart_a about cart_b data	PASS
3b	Qwen cross-model	Qwen3-14B + Qwen cartridge (server starts, cart loads)	PASS
4	50 sequential stability	Server alive after 50 requests	50/50 PASS
5	Error handling	Empty prompt, short prompt	PASS
6	TP=2 smoke	Llama + cartridge on 2×H100 — TP KV head sharding fixed (commit 99ee4e9)	PASS (sharding fixed)
6	bf16 dtype	Default in all tiers	PASS

Bugs found & fixed

Bug	Severity	Fix	Commit
Stale `triton_reshape_and_cache_flash` import	Build	`ops.reshape_and_cache_flash`	43714e7
`kv_cache_config` kwarg not in base class	Init	Remove from `__init__`	1059e66
`_kv_transfer_config` not set by base	Init	Set from `vllm_config`	1059e66
All prompt tokens claimed as external	Runtime	Cap at `len(prompt_ids) - 1`	43a24bb
`LogprobsTensors` OverflowError	Runtime	`torch.zeros` instead of `torch.empty`	0211fb4
TP>1 KV head sharding	Runtime	Slice cart KV heads to `[tp_rankH/tp : (tp_rank+1)H/tp]`	99ee4e9

Unit test coverage

All tests run CPU-only via pytest tests/v1/core/test_cartridge*.py. No GPU required.

File	Tests	What it covers
`test_cartridge_connector.py`	CPU fake-writer, slot mapping, KV injection shapes	Core inject path correctness
`test_cartridge_connector_integration.py`	Scheduler API: get_num_new_matched_tokens, update_state_after_alloc, build_connector_meta	State machine correctness
`test_cartridge_manifest.py`	JSON load, model/shape/checksum validation	Manifest safety checks
`test_cartridge_registry.py`	SQLite insert/get/list/label lookup	Registry operations
`test_cartridge_store.py`	Load/get/evict, refcount, pin, thread safety, memory tracking	Store lifecycle
`test_cartridge_router.py`	Explicit/static/label/composite routers, config builder	Per-request dispatch routing
`test_cartridge_routing.py`	Metadata carries cart_id, two-request isolation, kernel-level no-stomp	Multi-cart dispatch correctness
`test_cartridge_fault.py`	Missing file, shape mismatch, registry errors	Error path safety
`test_cartridge_gpu_residency.py`	LRU eviction, refcount, capacity, pin/unpin, memory stability	GPU memory management
`test_cartridge_lmcache_plugin.py`	Optional import, plugin reject, key mapping	LMCache integration safety

How to reproduce

# Clone the branch
git clone --branch 20260429-cartridges-code-only https://github.com/mcgrof/vllm.git
cd vllm

# Unit tests (no GPU needed)
uv venv --python 3.12 && source .venv/bin/activate
VLLM_USE_PRECOMPILED=1 uv pip install -e . --torch-backend=auto
pytest tests/v1/core/test_cartridge*.py -v

# Smoke serve (needs GPU + trained cartridge .pt)
vllm serve meta-llama/Llama-3.2-3B-Instruct \
  --kv-transfer-config '{
    "kv_connector": "CartridgeConnector",
    "kv_connector_module_path":
      "vllm.distributed.kv_transfer.kv_connector.v1.cartridge_connector",
    "kv_connector_extra_config": {
      "cartridge_path": "/path/to/cartridge.pt"
    },
    "kv_role": "kv_both"
  }'

# Multi-cartridge dispatch
vllm serve meta-llama/Llama-3.2-3B-Instruct \
  --kv-transfer-config '{
    "kv_connector": "CartridgeConnector",
    "kv_connector_module_path":
      "vllm.distributed.kv_transfer.kv_connector.v1.cartridge_connector",
    "kv_connector_extra_config": {
      "cartridges": [
        {"cartridge_id": "doc_a", "path": "/path/to/cart_a.pt"},
        {"cartridge_id": "doc_b", "path": "/path/to/cart_b.pt"}
      ],
      "router": {"type": "explicit"}
    },
    "kv_role": "kv_both"
  }'

Reproducing the Tests

Full guide: cartridges_reproduce.md

Automated (knlp pipeline)

# Clone knlp, select cartridges config, run everything
git clone https://github.com/mcgrof/knlp.git && cd knlp
make defconfig-cartridges-vllm-tests
make

# Pipeline: doctor → fetch → build → test → report
# Reuses existing vllm checkout if present (no re-clone)
# Runs 178 unit tests (no GPU needed for Tier -1)

# Run a single tier
make cartridges-test TIER=-1   # unit tests only
make cartridges-test TIER=6    # TP=2 only (needs 2 GPUs)

Manual: unit tests (no GPU)

git clone --branch 20260429-cartridges-code-only \
  https://github.com/mcgrof/vllm.git && cd vllm
VLLM_USE_PRECOMPILED=1 uv venv --python 3.12
source .venv/bin/activate
VLLM_USE_PRECOMPILED=1 uv pip install -e . --torch-backend=auto
uv pip install transformers==4.55.4 pytest pytest-timeout tblib

pytest tests/v1/core/test_cartridge*.py -v --timeout=120
# Expected: 178 passed

Manual: serve with cartridge (needs GPU)

# Single cartridge
vllm serve meta-llama/Llama-3.2-3B-Instruct \
  --kv-transfer-config '{
    "kv_connector": "CartridgeConnector",
    "kv_connector_module_path":
      "vllm.distributed.kv_transfer.kv_connector.v1.cartridge_connector",
    "kv_connector_extra_config": {
      "cartridge_path": "/path/to/cartridge.pt"
    },
    "kv_role": "kv_both"
  }'

# Multi-cartridge with per-request dispatch
vllm serve meta-llama/Llama-3.2-3B-Instruct \
  --kv-transfer-config '{
    "kv_connector": "CartridgeConnector",
    "kv_connector_module_path":
      "vllm.distributed.kv_transfer.kv_connector.v1.cartridge_connector",
    "kv_connector_extra_config": {
      "cartridges": [
        {"cartridge_id": "doc_a", "path": "cart_a.pt"},
        {"cartridge_id": "doc_b", "path": "cart_b.pt"}
      ],
      "router": {"type": "explicit"}
    },
    "kv_role": "kv_both"
  }'

# TP=2 (needs 2 GPUs)
vllm serve meta-llama/Llama-3.2-3B-Instruct \
  --tensor-parallel-size 2 \
  --kv-transfer-config '{...same as above...}'

Building a test cartridge from any text

# No trained cartridge needed — prefill any text to create one
python3 -c "
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, DynamicCache

tok = AutoTokenizer.from_pretrained('meta-llama/Llama-3.2-3B-Instruct')
model = AutoModelForCausalLM.from_pretrained(
    'meta-llama/Llama-3.2-3B-Instruct',
    torch_dtype=torch.bfloat16).to('cuda').eval()

text = 'Your document text here. The model will use this as context.'
ids = tok.encode(text, add_special_tokens=False)
cache = DynamicCache()
with torch.no_grad():
    model(input_ids=torch.tensor([ids], device='cuda'),
          past_key_values=cache, use_cache=True, return_dict=True)

Ks, Vs = [], []
for li in range(len(cache.layers)):
    layer = cache.layers[li]
    k = layer.keys.cpu() if hasattr(layer, 'keys') else layer.key_cache[0].cpu()
    v = layer.values.cpu() if hasattr(layer, 'values') else layer.value_cache[0].cpu()
    Ks.append(k); Vs.append(v)

torch.save({'trainable_keys': Ks, 'trainable_values': Vs,
            'frozen_keys': [], 'frozen_values': []}, 'my_cartridge.pt')
print(f'Saved: {len(Ks)} layers, {Ks[0].shape[-2]} tokens')
"

Key points for AI assistants reproducing these tests

Branch: 20260429-cartridges-code-only on mcgrof/vllm
The tokenizer needs a one-line patch for all_special_tokens_extended (fork-point compat — automated in the knlp pipeline)
Use VLLM_USE_PRECOMPILED=1 to avoid compiling C extensions from source
Prompts must include the document text as prefix for the cartridge context to work
The CartridgeConnector replaces KV for the first N tokens of the prompt — it does not add tokens
TP>1 works: the connector shards KV heads across ranks automatically
Unit tests are CPU-only and cover connector, manifest, registry, store, router, dispatch, residency, LMCache plugin, and fault paths