Pre-trained KV caches that skip prefill entirely. Train once per document, inject at serving time, eliminate the prefill bottleneck. Based on arxiv:2508.17032. 55 configurations validated across 3 architectures and 6 documents on H100 and W7900. Production connector with multi-cartridge routing, GPU residency, LMCache tiering, and SQLite registry.
A cartridge replaces runtime prefill with a pre-trained KV cache. Instead of processing a document through the model at every request, the KV state is trained once offline and injected directly into GPU memory as a CPU→GPU memcpy.
Checkpoint size is dominated by KV head count. At p=256:
| Model | KV Heads | Checkpoint | At p=8192 |
|---|---|---|---|
| Qwen2.5-7B (GQA) | 4 | 15 MB | 470 MB |
| Qwen2.5-1.5B (GQA) | 2 | 7 MB | 224 MB |
| Llama-2-7b (MHA) | 32 | 128 MB | 4,856 MB |
GQA models produce 8.5x smaller cartridges than MHA models at the same prefix length. GQA is strongly preferred for cartridge deployment.
Cartridge prefill time scales sub-linearly with prefix length because loading pre-computed KV (memcpy) is cheaper than running a forward pass over p tokens. Measured on Llama-3.2-1B-Instruct, AMD W7900, FlexAttention + torch.compile.
| Prefix tokens | ICL prefill | Cartridge prefill | Speedup |
|---|---|---|---|
| 256 | 0.037s | 0.038s | 1.0x |
| 512 | 0.050s | 0.040s | 1.3x |
| 1,024 | 0.072s | 0.041s | 1.8x |
| 2,048 | 0.138s | 0.048s | 2.9x |
| 4,096 | 0.312s | 0.071s | 4.4x |
| 8,192 | 0.758s | 0.110s | 6.9x |
| 16,384 | 2.027s | 0.192s | 10.6x |
At p=16,384: Cartridge achieves 44.5 tok/s vs ICL at 27.4 tok/s (1.6x end-to-end). The speedup comes entirely from prefill savings; decode throughput is identical.
On H100, prefill at 4K tokens takes only ~80µs of GPU time. TTFT is dominated by Python/HTTP overhead (~5ms), vLLM scheduling (~10ms), and network latency (~3ms). The cartridge advantage manifests at longer prefixes, lower-end GPUs, or higher concurrency.
LLM-as-judge: 10 factual questions per document, each answer scored 0–5. Instruct models self-judge; Llama-2-7b answers judged by Qwen-7B.
| Document | No prefix | Full | 4x comp. | 16x comp. | p=256 |
|---|---|---|---|---|---|
| ML Research (8K) | 1.2 | 4.5 | 3.3 | 3.4 | 3.1 |
| Llama 2 paper (20K) | 2.7 | 3.7 | 2.9 | 2.1 | 2.5 |
| GPL v3 (7K) | 2.9 | 3.8 | 2.8 | 2.3 | 2.4 |
| Clinical (12K) | 2.5 | 3.9 | 4.1 | 3.7 | 3.3 |
| RFC 9110 (49K) | 2.5 | — | 4.1 | 3.9 | 3.7 |
| Wikipedia (2K) | 3.6 | 4.4 | 3.8 | 4.0 | 3.7 |
Instruct tuning is critical. At p=256, instruct models score 2.4–4.0 while the base model scores 0.0–3.2 with high variance.
Model size matters less than expected. Qwen-1.5B (1.5B params) often matches Qwen-7B (7B params), suggesting cartridge training adapts well to smaller models.
Cartridges build on Prefix Tuning (Li and Liang, 2021). The core idea: prepend p trainable continuous vectors to the key and value matrices at every attention layer. These vectors are optimized via gradient descent while the model weights stay frozen.
Li and Liang found that initialization strongly affects final quality:
| Strategy | Description | Quality |
|---|---|---|
| Random Gaussian | Sample from N(0, 0.02) | Worst — optimizer starts far from viable KV space |
| Vocabulary embeddings | Sample from model's word embedding matrix | Better — starts in a region the model recognizes |
| Real activations (First-k) | Run the actual document through the model, take first p KV vectors | Best — warm start in the exact KV space the model produces |
First-k initialization (used by cartridges) is the natural extension: the initial KV state comes from actually processing the document, giving the optimizer a warm start in a region of KV space the model already understands. The optimizer then refines these vectors to compress the full document's information into p positions.
Original prefix tuning was a fine-tuning technique: train task-specific prefixes to adapt a frozen model. Cartridges repurpose it as an inference optimization: the trained prefix is the document representation, and injecting it replaces prefill entirely.
SCI (Sampled Chunk Initialization) samples random 64-token chunks from across the document instead of taking the first p tokens. In theory, this captures broader content. In practice, both strategies produce nearly identical results:
| Prefix | First-k (s) | SCI (s) | E2E First-k (tok/s) | E2E SCI (tok/s) |
|---|---|---|---|---|
| 256 | 661 | 682 | 59.5 | 60.0 |
| 4,096 | 1,144 | 1,131 | 56.5 | 55.9 |
| 16,384 | 2,684 | 2,610 | 44.5 | 44.2 |
SCI's broader initialization does not translate into measurable quality or performance advantage over First-k.
CartridgeConnector is a KVConnectorBase_V1 plugin built as a production stack of seven modules (~2,800 lines). It supports single-cartridge (backward-compatible) and multi-cartridge (per-request dispatch) modes with no vLLM core modifications.
| Module | Responsibility |
|---|---|
| CartridgeConnector | KVConnectorBase_V1 plugin. Hooks into the V1 scheduler via get_num_new_matched_tokens() and materialises KV via triton_reshape_and_cache_flash. |
| CartridgeManifest | Per-cartridge metadata + strict format validation. Names the model, layout, quantisation, and provenance. |
| CartridgeRegistry | SQLite-backed manifest index. Registry errors fall back to full prefill; a bad cartridge never poisons the serving path. |
| CartridgeStore | Read-only chunked KV access with residency bookkeeping. Payload stored separately so manifests can be scanned cheaply. |
| CartridgeGPUResidency | LRU eviction + pinning across a bounded GPU budget (default 8× largest cartridge or 1 GiB). Hot cartridges stay resident; cold ones spill. |
| CartridgeLMCachePlugin | Implements the LMCache StoragePluginInterface (read-only, min LMCache v0.3.13). Cartridges flow through the existing tiered KV offload without patches. |
| CartridgeRouter | Per-request dispatch. Explicit ids, named labels, static mapping, and composite chains. Carries cartridge_id end-to-end through connector metadata to the worker. |
vllm serve Qwen/Qwen2.5-7B-Instruct \
--kv-transfer-config '{
"kv_connector": "CartridgeConnector",
"kv_connector_module_path":
"vllm.distributed.kv_transfer.kv_connector.v1.cartridge_connector",
"kv_connector_extra_config": {
"cartridge_path": "/path/to/cartridge.pt"
},
"kv_role": "kv_both"
}'
vllm serve Qwen/Qwen2.5-7B-Instruct \
--kv-transfer-config '{
"kv_connector": "CartridgeConnector",
"kv_connector_module_path":
"vllm.distributed.kv_transfer.kv_connector.v1.cartridge_connector",
"kv_connector_extra_config": {
"cartridges": {
"medical": "/data/cartridges/medical.pt",
"legal": "/data/cartridges/legal.pt",
"clinical": "/data/cartridges/clinical.pt"
},
"router": {"type": "label", "field": "cartridge_id"}
},
"kv_role": "kv_both"
}'
Two concurrent requests can inject different cartridges into different allocated blocks without cross-contamination. The router carries cartridge_id end-to-end from the scheduler to the worker side of the connector.
| Phase | What it tests | Result |
|---|---|---|
| Phase 0 | Single-cartridge load + KV injection smoke | PASS |
| Phase 0.5 | LMCache StoragePluginInterface integration | PASS |
| Phase 1 | Multi-cartridge routing (3 cartridges, same model) | PASS |
| Phase 2 | TTFT benchmark: cartridge vs dense prefill | 7.0x p50 speedup |
| Phase 3 | Multi-cartridge routing dispatch (20 cartridges) | PASS · 57% hit at 5/20 capacity |
| Phase 4 | GPU residency stress under pressure | PASS |
# Minimal build for cartridge-only serving on W7900 / gfx1100 VLLM_BUILD_PROFILE=cartridges_rocm pip install -e .
Builds only the extensions a cartridge-serving path needs (vllm._C, vllm._rocm_C, cumem_allocator). Includes a gfx1100 sampler thread-count fix and a gfx1x skinny_gemm wave32 workaround.
| Property | Qwen-7B | Qwen-1.5B | Llama-2-7B |
|---|---|---|---|
| Parameters | 7B | 1.5B | 7B |
| KV Heads | 4 (GQA) | 2 (GQA) | 32 (MHA) |
| p=256 checkpoint | 15 MB | 7 MB | 128 MB |
| p=8192 checkpoint | 470 MB | 224 MB | 4,856 MB |
| Best full score | 4.5 | 4.5 | 4.3 |
| Avg p=256 score | 3.1 | 3.5 | 1.6 |
GQA models produce dramatically smaller cartridges. 8.5x smaller than MHA at the same prefix. GQA is strongly preferred.
Instruct tuning is critical for compression. Base Llama-2-7b scores 0.0–3.2 at 4x+ compression vs 2.4–4.1 for instruct models.
Small instruct models are surprisingly effective. Qwen-1.5B (1.5B params) often matches or exceeds Qwen-7B (7B params) at compressed cartridge sizes.
| Prefix | Conc=1 | Conc=4 | Conc=16 | Output TPS (c=1) |
|---|---|---|---|---|
| 4,096 | 31ms | 43ms | 75ms | 166 |
| 8,192 | 41ms | 69ms | 127ms | 163 |
| 16,384 | 78ms | 127ms | 245ms | 158 |
| 32,768 | 56ms | 136ms | 261ms | 158 |
Full validation of the 20260429-cartridges-code-only branch. Validates cartridge KV injection, per-request dispatch, manifest validation, GPU residency, and serving correctness.
| Tier | Test | Method | Result |
|---|---|---|---|
| -1 | Unit / static | pytest tests/v1/core/test_cartridge*.py | 178/178 PASS |
| 0 | Single-cart smoke | vllm serve + CartridgeConnector + 10 requests | PASS |
| 0.5 | Prefill canary | HF model.forward() vs cartridge KV — top-1 token match, logit diff 0.16 | PASS |
| 1 | Context verification | 10 factual medical questions via chat API | 10/10 PASS |
| 1b | Generated token logprobs | /v1/completions with logprobs=1 | PASS |
| 2 | 16 concurrent requests | 8-thread pool, same cartridge | 16/16 PASS |
| 3 | Multi-cart dispatch | 3 carts, ExplicitCartridgeRouter, per-request routing | PASS |
| 3 (cross) | No cross-contamination | Ask cart_a about cart_b data | PASS |
| 3b | Qwen cross-model | Qwen3-14B + Qwen cartridge (server starts, cart loads) | PASS |
| 4 | 50 sequential stability | Server alive after 50 requests | 50/50 PASS |
| 5 | Error handling | Empty prompt, short prompt | PASS |
| 6 | TP=2 smoke | Llama + cartridge on 2×H100 — TP KV head sharding fixed (commit 99ee4e9) | PASS (sharding fixed) |
| 6 | bf16 dtype | Default in all tiers | PASS |
| Bug | Severity | Fix | Commit |
|---|---|---|---|
Stale triton_reshape_and_cache_flash import | Build | ops.reshape_and_cache_flash | 43714e7 |
kv_cache_config kwarg not in base class | Init | Remove from __init__ | 1059e66 |
_kv_transfer_config not set by base | Init | Set from vllm_config | 1059e66 |
| All prompt tokens claimed as external | Runtime | Cap at len(prompt_ids) - 1 | 43a24bb |
LogprobsTensors OverflowError | Runtime | torch.zeros instead of torch.empty | 0211fb4 |
| TP>1 KV head sharding | Runtime | Slice cart KV heads to [tp_rank*H/tp : (tp_rank+1)*H/tp] | 99ee4e9 |
All tests run CPU-only via pytest tests/v1/core/test_cartridge*.py. No GPU required.
| File | Tests | What it covers |
|---|---|---|
test_cartridge_connector.py | CPU fake-writer, slot mapping, KV injection shapes | Core inject path correctness |
test_cartridge_connector_integration.py | Scheduler API: get_num_new_matched_tokens, update_state_after_alloc, build_connector_meta | State machine correctness |
test_cartridge_manifest.py | JSON load, model/shape/checksum validation | Manifest safety checks |
test_cartridge_registry.py | SQLite insert/get/list/label lookup | Registry operations |
test_cartridge_store.py | Load/get/evict, refcount, pin, thread safety, memory tracking | Store lifecycle |
test_cartridge_router.py | Explicit/static/label/composite routers, config builder | Per-request dispatch routing |
test_cartridge_routing.py | Metadata carries cart_id, two-request isolation, kernel-level no-stomp | Multi-cart dispatch correctness |
test_cartridge_fault.py | Missing file, shape mismatch, registry errors | Error path safety |
test_cartridge_gpu_residency.py | LRU eviction, refcount, capacity, pin/unpin, memory stability | GPU memory management |
test_cartridge_lmcache_plugin.py | Optional import, plugin reject, key mapping | LMCache integration safety |
# Clone the branch git clone --branch 20260429-cartridges-code-only https://github.com/mcgrof/vllm.git cd vllm # Unit tests (no GPU needed) uv venv --python 3.12 && source .venv/bin/activate VLLM_USE_PRECOMPILED=1 uv pip install -e . --torch-backend=auto pytest tests/v1/core/test_cartridge*.py -v # Smoke serve (needs GPU + trained cartridge .pt) vllm serve meta-llama/Llama-3.2-3B-Instruct \ --kv-transfer-config '{ "kv_connector": "CartridgeConnector", "kv_connector_module_path": "vllm.distributed.kv_transfer.kv_connector.v1.cartridge_connector", "kv_connector_extra_config": { "cartridge_path": "/path/to/cartridge.pt" }, "kv_role": "kv_both" }' # Multi-cartridge dispatch vllm serve meta-llama/Llama-3.2-3B-Instruct \ --kv-transfer-config '{ "kv_connector": "CartridgeConnector", "kv_connector_module_path": "vllm.distributed.kv_transfer.kv_connector.v1.cartridge_connector", "kv_connector_extra_config": { "cartridges": [ {"cartridge_id": "doc_a", "path": "/path/to/cart_a.pt"}, {"cartridge_id": "doc_b", "path": "/path/to/cart_b.pt"} ], "router": {"type": "explicit"} }, "kv_role": "kv_both" }'
Full guide: cartridges_reproduce.md
# Clone knlp, select cartridges config, run everything git clone https://github.com/mcgrof/knlp.git && cd knlp make defconfig-cartridges-vllm-tests make # Pipeline: doctor → fetch → build → test → report # Reuses existing vllm checkout if present (no re-clone) # Runs 178 unit tests (no GPU needed for Tier -1) # Run a single tier make cartridges-test TIER=-1 # unit tests only make cartridges-test TIER=6 # TP=2 only (needs 2 GPUs)
git clone --branch 20260429-cartridges-code-only \
https://github.com/mcgrof/vllm.git && cd vllm
VLLM_USE_PRECOMPILED=1 uv venv --python 3.12
source .venv/bin/activate
VLLM_USE_PRECOMPILED=1 uv pip install -e . --torch-backend=auto
uv pip install transformers==4.55.4 pytest pytest-timeout tblib
pytest tests/v1/core/test_cartridge*.py -v --timeout=120
# Expected: 178 passed
# Single cartridge vllm serve meta-llama/Llama-3.2-3B-Instruct \ --kv-transfer-config '{ "kv_connector": "CartridgeConnector", "kv_connector_module_path": "vllm.distributed.kv_transfer.kv_connector.v1.cartridge_connector", "kv_connector_extra_config": { "cartridge_path": "/path/to/cartridge.pt" }, "kv_role": "kv_both" }' # Multi-cartridge with per-request dispatch vllm serve meta-llama/Llama-3.2-3B-Instruct \ --kv-transfer-config '{ "kv_connector": "CartridgeConnector", "kv_connector_module_path": "vllm.distributed.kv_transfer.kv_connector.v1.cartridge_connector", "kv_connector_extra_config": { "cartridges": [ {"cartridge_id": "doc_a", "path": "cart_a.pt"}, {"cartridge_id": "doc_b", "path": "cart_b.pt"} ], "router": {"type": "explicit"} }, "kv_role": "kv_both" }' # TP=2 (needs 2 GPUs) vllm serve meta-llama/Llama-3.2-3B-Instruct \ --tensor-parallel-size 2 \ --kv-transfer-config '{...same as above...}'
# No trained cartridge needed — prefill any text to create one
python3 -c "
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, DynamicCache
tok = AutoTokenizer.from_pretrained('meta-llama/Llama-3.2-3B-Instruct')
model = AutoModelForCausalLM.from_pretrained(
'meta-llama/Llama-3.2-3B-Instruct',
torch_dtype=torch.bfloat16).to('cuda').eval()
text = 'Your document text here. The model will use this as context.'
ids = tok.encode(text, add_special_tokens=False)
cache = DynamicCache()
with torch.no_grad():
model(input_ids=torch.tensor([ids], device='cuda'),
past_key_values=cache, use_cache=True, return_dict=True)
Ks, Vs = [], []
for li in range(len(cache.layers)):
layer = cache.layers[li]
k = layer.keys.cpu() if hasattr(layer, 'keys') else layer.key_cache[0].cpu()
v = layer.values.cpu() if hasattr(layer, 'values') else layer.value_cache[0].cpu()
Ks.append(k); Vs.append(v)
torch.save({'trainable_keys': Ks, 'trainable_values': Vs,
'frozen_keys': [], 'frozen_values': []}, 'my_cartridge.pt')
print(f'Saved: {len(Ks)} layers, {Ks[0].shape[-2]} tokens')
"
20260429-cartridges-code-only on mcgrof/vllmall_special_tokens_extended (fork-point compat — automated in the knlp pipeline)VLLM_USE_PRECOMPILED=1 to avoid compiling C extensions from source