Pre-trained KV caches that skip prefill entirely. Train once per document, inject at serving time, eliminate the prefill bottleneck. Based on arxiv:2508.17032. 55 configurations validated across 3 architectures and 6 documents on H100 and W7900.
A cartridge replaces runtime prefill with a pre-trained KV cache. Instead of processing a document through the model at every request, the KV state is trained once offline and injected directly into GPU memory as a CPU→GPU memcpy.
Checkpoint size is dominated by KV head count. At p=256:
| Model | KV Heads | Checkpoint | At p=8192 |
|---|---|---|---|
| Qwen2.5-7B (GQA) | 4 | 15 MB | 470 MB |
| Qwen2.5-1.5B (GQA) | 2 | 7 MB | 224 MB |
| Llama-2-7b (MHA) | 32 | 128 MB | 4,856 MB |
GQA models produce 8.5x smaller cartridges than MHA models at the same prefix length. GQA is strongly preferred for cartridge deployment.
Cartridge prefill time scales sub-linearly with prefix length because loading pre-computed KV (memcpy) is cheaper than running a forward pass over p tokens. Measured on Llama-3.2-1B-Instruct, AMD W7900, FlexAttention + torch.compile.
| Prefix tokens | ICL prefill | Cartridge prefill | Speedup |
|---|---|---|---|
| 256 | 0.037s | 0.038s | 1.0x |
| 512 | 0.050s | 0.040s | 1.3x |
| 1,024 | 0.072s | 0.041s | 1.8x |
| 2,048 | 0.138s | 0.048s | 2.9x |
| 4,096 | 0.312s | 0.071s | 4.4x |
| 8,192 | 0.758s | 0.110s | 6.9x |
| 16,384 | 2.027s | 0.192s | 10.6x |
At p=16,384: Cartridge achieves 44.5 tok/s vs ICL at 27.4 tok/s (1.6x end-to-end). The speedup comes entirely from prefill savings; decode throughput is identical.
On H100, prefill at 4K tokens takes only ~80µs of GPU time. TTFT is dominated by Python/HTTP overhead (~5ms), vLLM scheduling (~10ms), and network latency (~3ms). The cartridge advantage manifests at longer prefixes, lower-end GPUs, or higher concurrency.
LLM-as-judge: 10 factual questions per document, each answer scored 0–5. Instruct models self-judge; Llama-2-7b answers judged by Qwen-7B.
| Document | No prefix | Full | 4x comp. | 16x comp. | p=256 |
|---|---|---|---|---|---|
| ML Research (8K) | 1.2 | 4.5 | 3.3 | 3.4 | 3.1 |
| Llama 2 paper (20K) | 2.7 | 3.7 | 2.9 | 2.1 | 2.5 |
| GPL v3 (7K) | 2.9 | 3.8 | 2.8 | 2.3 | 2.4 |
| Clinical (12K) | 2.5 | 3.9 | 4.1 | 3.7 | 3.3 |
| RFC 9110 (49K) | 2.5 | — | 4.1 | 3.9 | 3.7 |
| Wikipedia (2K) | 3.6 | 4.4 | 3.8 | 4.0 | 3.7 |
Instruct tuning is critical. At p=256, instruct models score 2.4–4.0 while the base model scores 0.0–3.2 with high variance.
Model size matters less than expected. Qwen-1.5B (1.5B params) often matches Qwen-7B (7B params), suggesting cartridge training adapts well to smaller models.
Cartridges build on Prefix Tuning (Li and Liang, 2021). The core idea: prepend p trainable continuous vectors to the key and value matrices at every attention layer. These vectors are optimized via gradient descent while the model weights stay frozen.
Li and Liang found that initialization strongly affects final quality:
| Strategy | Description | Quality |
|---|---|---|
| Random Gaussian | Sample from N(0, 0.02) | Worst — optimizer starts far from viable KV space |
| Vocabulary embeddings | Sample from model's word embedding matrix | Better — starts in a region the model recognizes |
| Real activations (First-k) | Run the actual document through the model, take first p KV vectors | Best — warm start in the exact KV space the model produces |
First-k initialization (used by cartridges) is the natural extension: the initial KV state comes from actually processing the document, giving the optimizer a warm start in a region of KV space the model already understands. The optimizer then refines these vectors to compress the full document's information into p positions.
Original prefix tuning was a fine-tuning technique: train task-specific prefixes to adapt a frozen model. Cartridges repurpose it as an inference optimization: the trained prefix is the document representation, and injecting it replaces prefill entirely.
SCI (Sampled Chunk Initialization) samples random 64-token chunks from across the document instead of taking the first p tokens. In theory, this captures broader content. In practice, both strategies produce nearly identical results:
| Prefix | First-k (s) | SCI (s) | E2E First-k (tok/s) | E2E SCI (tok/s) |
|---|---|---|---|---|
| 256 | 661 | 682 | 59.5 | 60.0 |
| 4,096 | 1,144 | 1,131 | 56.5 | 55.9 |
| 16,384 | 2,684 | 2,610 | 44.5 | 44.2 |
SCI's broader initialization does not translate into measurable quality or performance advantage over First-k.
CartridgeConnector plugs into vLLM v0.16.0 as a KVConnectorBase_V1 plugin with zero vLLM core modifications. About 300 lines of Python.
| Requirement | Value |
|---|---|
| vLLM version | ≥ 0.16.0 (KVConnectorBase_V1 API) |
| Artifacts | cartridge.pt + prefix_token_ids.json |
| Core patches | None |
| Connector size | ~300 lines Python |
export CARTRIDGE=/path/to/cartridge.pt export PREFIX_IDS=/path/to/prefix_token_ids.json vllm serve Qwen/Qwen2.5-7B-Instruct \ --host 0.0.0.0 \ --port 8000 \ --max-model-len 8192 \ --gpu-memory-utilization 0.40 \ --kv-transfer-config '{ "kv_connector": "CartridgeConnector", "kv_connector_module_path": "vllm.distributed.kv_transfer.kv_connector.v1.cartridge_connector", "kv_connector_extra_config": { "cartridge_path": "'"$CARTRIDGE"'", "prefix_token_ids_path": "'"$PREFIX_IDS"'" }, "kv_role": "kv_both" }'
class CartridgeConnector(KVConnectorBase_V1): def __init__(self, rank, local_rank, config): self.cartridge = torch.load(config.cartridge_path, map_location="cpu") self.prefix_token_ids = load_json(config.prefix_token_ids_path) self.layer_k = [layer["k"].contiguous() for layer in self.cartridge] self.layer_v = [layer["v"].contiguous() for layer in self.cartridge] def get_num_new_matched_tokens(self, request, **kwargs): # Compare request tokens against stored prefix, block-aligned matched = longest_common_prefix( request.prompt_token_ids, self.prefix_token_ids ) return align_down_to_block_size(matched, block_size=16) def start_load_kv(self, connector_meta, **kwargs): # Scatter pre-trained KV from CPU to vLLM-assigned GPU slots slot_mapping = connector_meta["slot_mapping"] matched = connector_meta["matched_tokens"] for layer_idx in range(len(self.layer_k)): scatter_to_gpu_blocks( self.layer_k[layer_idx][:matched], self.layer_v[layer_idx][:matched], slot_mapping[:matched], )
curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen2.5-7B-Instruct", "messages": [ {"role": "system", "content": "<full training document text>"}, {"role": "user", "content": "What is the main contribution?"} ] }'
| Property | Qwen-7B | Qwen-1.5B | Llama-2-7B |
|---|---|---|---|
| Parameters | 7B | 1.5B | 7B |
| KV Heads | 4 (GQA) | 2 (GQA) | 32 (MHA) |
| p=256 checkpoint | 15 MB | 7 MB | 128 MB |
| p=8192 checkpoint | 470 MB | 224 MB | 4,856 MB |
| Best full score | 4.5 | 4.5 | 4.3 |
| Avg p=256 score | 3.1 | 3.5 | 1.6 |
GQA models produce dramatically smaller cartridges. 8.5x smaller than MHA at the same prefix. GQA is strongly preferred.
Instruct tuning is critical for compression. Base Llama-2-7b scores 0.0–3.2 at 4x+ compression vs 2.4–4.1 for instruct models.
Small instruct models are surprisingly effective. Qwen-1.5B (1.5B params) often matches or exceeds Qwen-7B (7B params) at compressed cartridge sizes.
| Prefix | Conc=1 | Conc=4 | Conc=16 | Output TPS (c=1) |
|---|---|---|---|---|
| 4,096 | 31ms | 43ms | 75ms | 166 |
| 8,192 | 41ms | 69ms | 127ms | 163 |
| 16,384 | 78ms | 127ms | 245ms | 158 |
| 32,768 | 56ms | 136ms | 261ms | 158 |