Cartridges

Pre-trained KV caches that skip prefill entirely. Train once per document, inject at serving time, eliminate the prefill bottleneck. Based on arxiv:2508.17032. 55 configurations validated across 3 architectures and 6 documents on H100 and W7900. Production connector with multi-cartridge routing, GPU residency, LMCache tiering, and SQLite registry.

Documentation · vllm branch · knlp

Overview
Prefill Speedup
Quality
Prefix Tuning
vLLM Integration
Cross-Model
Validation
Reproduce

What is a cartridge?

A cartridge replaces runtime prefill with a pre-trained KV cache. Instead of processing a document through the model at every request, the KV state is trained once offline and injected directly into GPU memory as a CPU→GPU memcpy.

10.6x
Prefill speedup at 16K tokens
55
Configurations tested
7
Production modules (~2800 LOC)
4x
Compression sweet spot
185
Unit tests passing

How it works

Document
Offline training (128 steps)
Cartridge .pt file
CPU→GPU inject
Skip prefill, decode directly

Why GQA matters for cartridges

Checkpoint size is dominated by KV head count. At p=256:

ModelKV HeadsCheckpointAt p=8192
Qwen2.5-7B (GQA)415 MB470 MB
Qwen2.5-1.5B (GQA)27 MB224 MB
Llama-2-7b (MHA)32128 MB4,856 MB

GQA models produce 8.5x smaller cartridges than MHA models at the same prefix length. GQA is strongly preferred for cartridge deployment.

Prefill speedup

Cartridge prefill time scales sub-linearly with prefix length because loading pre-computed KV (memcpy) is cheaper than running a forward pass over p tokens. Measured on Llama-3.2-1B-Instruct, AMD W7900, FlexAttention + torch.compile.

Prefix tokensICL prefillCartridge prefillSpeedup
2560.037s0.038s1.0x
5120.050s0.040s1.3x
1,0240.072s0.041s1.8x
2,0480.138s0.048s2.9x
4,0960.312s0.071s4.4x
8,1920.758s0.110s6.9x
16,3842.027s0.192s10.6x

End-to-end throughput (tok/s)

At p=16,384: Cartridge achieves 44.5 tok/s vs ICL at 27.4 tok/s (1.6x end-to-end). The speedup comes entirely from prefill savings; decode throughput is identical.

vLLM TTFT on H100

On H100, prefill at 4K tokens takes only ~80µs of GPU time. TTFT is dominated by Python/HTTP overhead (~5ms), vLLM scheduling (~10ms), and network latency (~3ms). The cartridge advantage manifests at longer prefixes, lower-end GPUs, or higher concurrency.

Quality evaluation

LLM-as-judge: 10 factual questions per document, each answer scored 0–5. Instruct models self-judge; Llama-2-7b answers judged by Qwen-7B.

Qwen2.5-7B-Instruct

DocumentNo prefixFull4x comp.16x comp.p=256
ML Research (8K)1.24.53.33.43.1
Llama 2 paper (20K)2.73.72.92.12.5
GPL v3 (7K)2.93.82.82.32.4
Clinical (12K)2.53.94.13.73.3
RFC 9110 (49K)2.54.13.93.7
Wikipedia (2K)3.64.43.84.03.7
4x compression is the sweet spot. Quality retention is typically 75–100% of the full cartridge across all tested architectures. Instruct models compress well; base models struggle at 4x+ compression.

Key findings

Instruct tuning is critical. At p=256, instruct models score 2.4–4.0 while the base model scores 0.0–3.2 with high variance.

Model size matters less than expected. Qwen-1.5B (1.5B params) often matches Qwen-7B (7B params), suggesting cartridge training adapts well to smaller models.

Prefix Tuning

Cartridges build on Prefix Tuning (Li and Liang, 2021). The core idea: prepend p trainable continuous vectors to the key and value matrices at every attention layer. These vectors are optimized via gradient descent while the model weights stay frozen.

Initialization strategy matters

Li and Liang found that initialization strongly affects final quality:

StrategyDescriptionQuality
Random GaussianSample from N(0, 0.02)Worst — optimizer starts far from viable KV space
Vocabulary embeddingsSample from model's word embedding matrixBetter — starts in a region the model recognizes
Real activations (First-k)Run the actual document through the model, take first p KV vectorsBest — warm start in the exact KV space the model produces

First-k initialization (used by cartridges) is the natural extension: the initial KV state comes from actually processing the document, giving the optimizer a warm start in a region of KV space the model already understands. The optimizer then refines these vectors to compress the full document's information into p positions.

From fine-tuning to inference optimization

Original prefix tuning was a fine-tuning technique: train task-specific prefixes to adapt a frozen model. Cartridges repurpose it as an inference optimization: the trained prefix is the document representation, and injecting it replaces prefill entirely.

Prefix Tuning (2021)
Task-specific soft prompts
Cartridges (2025)
Document-specific KV caches

SCI vs First-k

SCI (Sampled Chunk Initialization) samples random 64-token chunks from across the document instead of taking the first p tokens. In theory, this captures broader content. In practice, both strategies produce nearly identical results:

PrefixFirst-k (s)SCI (s)E2E First-k (tok/s)E2E SCI (tok/s)
25666168259.560.0
4,0961,1441,13156.555.9
16,3842,6842,61044.544.2

SCI's broader initialization does not translate into measurable quality or performance advantage over First-k.

vLLM integration (v0.4)

CartridgeConnector is a KVConnectorBase_V1 plugin built as a production stack of seven modules (~2,800 lines). It supports single-cartridge (backward-compatible) and multi-cartridge (per-request dispatch) modes with no vLLM core modifications.

Production stack

ModuleResponsibility
CartridgeConnectorKVConnectorBase_V1 plugin. Hooks into the V1 scheduler via get_num_new_matched_tokens() and materialises KV via triton_reshape_and_cache_flash.
CartridgeManifestPer-cartridge metadata + strict format validation. Names the model, layout, quantisation, and provenance.
CartridgeRegistrySQLite-backed manifest index. Registry errors fall back to full prefill; a bad cartridge never poisons the serving path.
CartridgeStoreRead-only chunked KV access with residency bookkeeping. Payload stored separately so manifests can be scanned cheaply.
CartridgeGPUResidencyLRU eviction + pinning across a bounded GPU budget (default 8× largest cartridge or 1 GiB). Hot cartridges stay resident; cold ones spill.
CartridgeLMCachePluginImplements the LMCache StoragePluginInterface (read-only, min LMCache v0.3.13). Cartridges flow through the existing tiered KV offload without patches.
CartridgeRouterPer-request dispatch. Explicit ids, named labels, static mapping, and composite chains. Carries cartridge_id end-to-end through connector metadata to the worker.

Request lifecycle

1. Request arrives
2. Router resolves cartridge_id
3. Registry looks up manifest
4. Store loads KV (GPU-resident or fetch)
5. Scheduler allocates blocks
6. Connector scatters KV to paged cache
7. Decode (skip prefill)

Singleton mode (backward compatible)

vllm serve Qwen/Qwen2.5-7B-Instruct \
  --kv-transfer-config '{
    "kv_connector": "CartridgeConnector",
    "kv_connector_module_path":
      "vllm.distributed.kv_transfer.kv_connector.v1.cartridge_connector",
    "kv_connector_extra_config": {
      "cartridge_path": "/path/to/cartridge.pt"
    },
    "kv_role": "kv_both"
  }'

Multi-cartridge mode

vllm serve Qwen/Qwen2.5-7B-Instruct \
  --kv-transfer-config '{
    "kv_connector": "CartridgeConnector",
    "kv_connector_module_path":
      "vllm.distributed.kv_transfer.kv_connector.v1.cartridge_connector",
    "kv_connector_extra_config": {
      "cartridges": {
        "medical":  "/data/cartridges/medical.pt",
        "legal":    "/data/cartridges/legal.pt",
        "clinical": "/data/cartridges/clinical.pt"
      },
      "router": {"type": "label", "field": "cartridge_id"}
    },
    "kv_role": "kv_both"
  }'

Two concurrent requests can inject different cartridges into different allocated blocks without cross-contamination. The router carries cartridge_id end-to-end from the scheduler to the worker side of the connector.

W7900 validation

PhaseWhat it testsResult
Phase 0Single-cartridge load + KV injection smokePASS
Phase 0.5LMCache StoragePluginInterface integrationPASS
Phase 1Multi-cartridge routing (3 cartridges, same model)PASS
Phase 2TTFT benchmark: cartridge vs dense prefill7.0x p50 speedup
Phase 3Multi-cartridge routing dispatch (20 cartridges)PASS · 57% hit at 5/20 capacity
Phase 4GPU residency stress under pressurePASS

ROCm build profile

# Minimal build for cartridge-only serving on W7900 / gfx1100
VLLM_BUILD_PROFILE=cartridges_rocm pip install -e .

Builds only the extensions a cartridge-serving path needs (vllm._C, vllm._rocm_C, cumem_allocator). Includes a gfx1100 sampler thread-count fix and a gfx1x skinny_gemm wave32 workaround.

Design invariant: CartridgeConnector does not allocate KV memory. vLLM still owns allocation via BlockPool. The connector only fills the slots vLLM assigned. This integrates cleanly with the scheduler, continuous batching, and prefix caching.

Cross-model analysis

PropertyQwen-7BQwen-1.5BLlama-2-7B
Parameters7B1.5B7B
KV Heads4 (GQA)2 (GQA)32 (MHA)
p=256 checkpoint15 MB7 MB128 MB
p=8192 checkpoint470 MB224 MB4,856 MB
Best full score4.54.54.3
Avg p=256 score3.13.51.6

Key findings

GQA models produce dramatically smaller cartridges. 8.5x smaller than MHA at the same prefix. GQA is strongly preferred.

Instruct tuning is critical for compression. Base Llama-2-7b scores 0.0–3.2 at 4x+ compression vs 2.4–4.1 for instruct models.

Small instruct models are surprisingly effective. Qwen-1.5B (1.5B params) often matches or exceeds Qwen-7B (7B params) at compressed cartridge sizes.

Concurrency scaling (H100)

PrefixConc=1Conc=4Conc=16Output TPS (c=1)
4,09631ms43ms75ms166
8,19241ms69ms127ms163
16,38478ms127ms245ms158
32,76856ms136ms261ms158

CartridgeConnector Serving Validation

Full validation of the 20260429-cartridges-code-only branch. Validates cartridge KV injection, per-request dispatch, manifest validation, GPU residency, and serving correctness.

178
Unit tests passed
10/10
Context questions correct
16/16
Concurrent requests
50/50
Stability requests
5
Bugs found & fixed

Tier results

TierTestMethodResult
-1Unit / staticpytest tests/v1/core/test_cartridge*.py178/178 PASS
0Single-cart smokevllm serve + CartridgeConnector + 10 requestsPASS
0.5Prefill canaryHF model.forward() vs cartridge KV — top-1 token match, logit diff 0.16PASS
1Context verification10 factual medical questions via chat API10/10 PASS
1bGenerated token logprobs/v1/completions with logprobs=1PASS
216 concurrent requests8-thread pool, same cartridge16/16 PASS
3Multi-cart dispatch3 carts, ExplicitCartridgeRouter, per-request routingPASS
3 (cross)No cross-contaminationAsk cart_a about cart_b dataPASS
3bQwen cross-modelQwen3-14B + Qwen cartridge (server starts, cart loads)PASS
450 sequential stabilityServer alive after 50 requests50/50 PASS
5Error handlingEmpty prompt, short promptPASS
6TP=2 smokeLlama + cartridge on 2×H100 — TP KV head sharding fixed (commit 99ee4e9)PASS (sharding fixed)
6bf16 dtypeDefault in all tiersPASS

Bugs found & fixed

BugSeverityFixCommit
Stale triton_reshape_and_cache_flash importBuildops.reshape_and_cache_flash43714e7
kv_cache_config kwarg not in base classInitRemove from __init__1059e66
_kv_transfer_config not set by baseInitSet from vllm_config1059e66
All prompt tokens claimed as externalRuntimeCap at len(prompt_ids) - 143a24bb
LogprobsTensors OverflowErrorRuntimetorch.zeros instead of torch.empty0211fb4
TP>1 KV head shardingRuntimeSlice cart KV heads to [tp_rank*H/tp : (tp_rank+1)*H/tp]99ee4e9

Unit test coverage

All tests run CPU-only via pytest tests/v1/core/test_cartridge*.py. No GPU required.

FileTestsWhat it covers
test_cartridge_connector.pyCPU fake-writer, slot mapping, KV injection shapesCore inject path correctness
test_cartridge_connector_integration.pyScheduler API: get_num_new_matched_tokens, update_state_after_alloc, build_connector_metaState machine correctness
test_cartridge_manifest.pyJSON load, model/shape/checksum validationManifest safety checks
test_cartridge_registry.pySQLite insert/get/list/label lookupRegistry operations
test_cartridge_store.pyLoad/get/evict, refcount, pin, thread safety, memory trackingStore lifecycle
test_cartridge_router.pyExplicit/static/label/composite routers, config builderPer-request dispatch routing
test_cartridge_routing.pyMetadata carries cart_id, two-request isolation, kernel-level no-stompMulti-cart dispatch correctness
test_cartridge_fault.pyMissing file, shape mismatch, registry errorsError path safety
test_cartridge_gpu_residency.pyLRU eviction, refcount, capacity, pin/unpin, memory stabilityGPU memory management
test_cartridge_lmcache_plugin.pyOptional import, plugin reject, key mappingLMCache integration safety

How to reproduce

# Clone the branch
git clone --branch 20260429-cartridges-code-only https://github.com/mcgrof/vllm.git
cd vllm

# Unit tests (no GPU needed)
uv venv --python 3.12 && source .venv/bin/activate
VLLM_USE_PRECOMPILED=1 uv pip install -e . --torch-backend=auto
pytest tests/v1/core/test_cartridge*.py -v

# Smoke serve (needs GPU + trained cartridge .pt)
vllm serve meta-llama/Llama-3.2-3B-Instruct \
  --kv-transfer-config '{
    "kv_connector": "CartridgeConnector",
    "kv_connector_module_path":
      "vllm.distributed.kv_transfer.kv_connector.v1.cartridge_connector",
    "kv_connector_extra_config": {
      "cartridge_path": "/path/to/cartridge.pt"
    },
    "kv_role": "kv_both"
  }'

# Multi-cartridge dispatch
vllm serve meta-llama/Llama-3.2-3B-Instruct \
  --kv-transfer-config '{
    "kv_connector": "CartridgeConnector",
    "kv_connector_module_path":
      "vllm.distributed.kv_transfer.kv_connector.v1.cartridge_connector",
    "kv_connector_extra_config": {
      "cartridges": [
        {"cartridge_id": "doc_a", "path": "/path/to/cart_a.pt"},
        {"cartridge_id": "doc_b", "path": "/path/to/cart_b.pt"}
      ],
      "router": {"type": "explicit"}
    },
    "kv_role": "kv_both"
  }'

Reproducing the Tests

Full guide: cartridges_reproduce.md

Automated (knlp pipeline)

# Clone knlp, select cartridges config, run everything
git clone https://github.com/mcgrof/knlp.git && cd knlp
make defconfig-cartridges-vllm-tests
make

# Pipeline: doctor → fetch → build → test → report
# Reuses existing vllm checkout if present (no re-clone)
# Runs 178 unit tests (no GPU needed for Tier -1)

# Run a single tier
make cartridges-test TIER=-1   # unit tests only
make cartridges-test TIER=6    # TP=2 only (needs 2 GPUs)

Manual: unit tests (no GPU)

git clone --branch 20260429-cartridges-code-only \
  https://github.com/mcgrof/vllm.git && cd vllm
VLLM_USE_PRECOMPILED=1 uv venv --python 3.12
source .venv/bin/activate
VLLM_USE_PRECOMPILED=1 uv pip install -e . --torch-backend=auto
uv pip install transformers==4.55.4 pytest pytest-timeout tblib

pytest tests/v1/core/test_cartridge*.py -v --timeout=120
# Expected: 178 passed

Manual: serve with cartridge (needs GPU)

# Single cartridge
vllm serve meta-llama/Llama-3.2-3B-Instruct \
  --kv-transfer-config '{
    "kv_connector": "CartridgeConnector",
    "kv_connector_module_path":
      "vllm.distributed.kv_transfer.kv_connector.v1.cartridge_connector",
    "kv_connector_extra_config": {
      "cartridge_path": "/path/to/cartridge.pt"
    },
    "kv_role": "kv_both"
  }'

# Multi-cartridge with per-request dispatch
vllm serve meta-llama/Llama-3.2-3B-Instruct \
  --kv-transfer-config '{
    "kv_connector": "CartridgeConnector",
    "kv_connector_module_path":
      "vllm.distributed.kv_transfer.kv_connector.v1.cartridge_connector",
    "kv_connector_extra_config": {
      "cartridges": [
        {"cartridge_id": "doc_a", "path": "cart_a.pt"},
        {"cartridge_id": "doc_b", "path": "cart_b.pt"}
      ],
      "router": {"type": "explicit"}
    },
    "kv_role": "kv_both"
  }'

# TP=2 (needs 2 GPUs)
vllm serve meta-llama/Llama-3.2-3B-Instruct \
  --tensor-parallel-size 2 \
  --kv-transfer-config '{...same as above...}'

Building a test cartridge from any text

# No trained cartridge needed — prefill any text to create one
python3 -c "
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, DynamicCache

tok = AutoTokenizer.from_pretrained('meta-llama/Llama-3.2-3B-Instruct')
model = AutoModelForCausalLM.from_pretrained(
    'meta-llama/Llama-3.2-3B-Instruct',
    torch_dtype=torch.bfloat16).to('cuda').eval()

text = 'Your document text here. The model will use this as context.'
ids = tok.encode(text, add_special_tokens=False)
cache = DynamicCache()
with torch.no_grad():
    model(input_ids=torch.tensor([ids], device='cuda'),
          past_key_values=cache, use_cache=True, return_dict=True)

Ks, Vs = [], []
for li in range(len(cache.layers)):
    layer = cache.layers[li]
    k = layer.keys.cpu() if hasattr(layer, 'keys') else layer.key_cache[0].cpu()
    v = layer.values.cpu() if hasattr(layer, 'values') else layer.value_cache[0].cpu()
    Ks.append(k); Vs.append(v)

torch.save({'trainable_keys': Ks, 'trainable_values': Vs,
            'frozen_keys': [], 'frozen_values': []}, 'my_cartridge.pt')
print(f'Saved: {len(Ks)} layers, {Ks[0].shape[-2]} tokens')
"

Key points for AI assistants reproducing these tests