Cartridges

Pre-trained KV caches that skip prefill entirely. Train once per document, inject at serving time, eliminate the prefill bottleneck. Based on arxiv:2508.17032. 55 configurations validated across 3 architectures and 6 documents on H100 and W7900.

Documentation · knlp

Overview
Prefill Speedup
Quality
Prefix Tuning
vLLM Integration
Cross-Model

What is a cartridge?

A cartridge replaces runtime prefill with a pre-trained KV cache. Instead of processing a document through the model at every request, the KV state is trained once offline and injected directly into GPU memory as a CPU→GPU memcpy.

10.6x
Prefill speedup at 16K tokens
55
Configurations tested
0
vLLM core modifications
4x
Compression sweet spot

How it works

Document
Offline training (128 steps)
Cartridge .pt file
CPU→GPU inject
Skip prefill, decode directly

Why GQA matters for cartridges

Checkpoint size is dominated by KV head count. At p=256:

ModelKV HeadsCheckpointAt p=8192
Qwen2.5-7B (GQA)415 MB470 MB
Qwen2.5-1.5B (GQA)27 MB224 MB
Llama-2-7b (MHA)32128 MB4,856 MB

GQA models produce 8.5x smaller cartridges than MHA models at the same prefix length. GQA is strongly preferred for cartridge deployment.

Prefill speedup

Cartridge prefill time scales sub-linearly with prefix length because loading pre-computed KV (memcpy) is cheaper than running a forward pass over p tokens. Measured on Llama-3.2-1B-Instruct, AMD W7900, FlexAttention + torch.compile.

Prefix tokensICL prefillCartridge prefillSpeedup
2560.037s0.038s1.0x
5120.050s0.040s1.3x
1,0240.072s0.041s1.8x
2,0480.138s0.048s2.9x
4,0960.312s0.071s4.4x
8,1920.758s0.110s6.9x
16,3842.027s0.192s10.6x

End-to-end throughput (tok/s)

At p=16,384: Cartridge achieves 44.5 tok/s vs ICL at 27.4 tok/s (1.6x end-to-end). The speedup comes entirely from prefill savings; decode throughput is identical.

vLLM TTFT on H100

On H100, prefill at 4K tokens takes only ~80µs of GPU time. TTFT is dominated by Python/HTTP overhead (~5ms), vLLM scheduling (~10ms), and network latency (~3ms). The cartridge advantage manifests at longer prefixes, lower-end GPUs, or higher concurrency.

Quality evaluation

LLM-as-judge: 10 factual questions per document, each answer scored 0–5. Instruct models self-judge; Llama-2-7b answers judged by Qwen-7B.

Qwen2.5-7B-Instruct

DocumentNo prefixFull4x comp.16x comp.p=256
ML Research (8K)1.24.53.33.43.1
Llama 2 paper (20K)2.73.72.92.12.5
GPL v3 (7K)2.93.82.82.32.4
Clinical (12K)2.53.94.13.73.3
RFC 9110 (49K)2.54.13.93.7
Wikipedia (2K)3.64.43.84.03.7
4x compression is the sweet spot. Quality retention is typically 75–100% of the full cartridge across all tested architectures. Instruct models compress well; base models struggle at 4x+ compression.

Key findings

Instruct tuning is critical. At p=256, instruct models score 2.4–4.0 while the base model scores 0.0–3.2 with high variance.

Model size matters less than expected. Qwen-1.5B (1.5B params) often matches Qwen-7B (7B params), suggesting cartridge training adapts well to smaller models.

Prefix Tuning

Cartridges build on Prefix Tuning (Li and Liang, 2021). The core idea: prepend p trainable continuous vectors to the key and value matrices at every attention layer. These vectors are optimized via gradient descent while the model weights stay frozen.

Initialization strategy matters

Li and Liang found that initialization strongly affects final quality:

StrategyDescriptionQuality
Random GaussianSample from N(0, 0.02)Worst — optimizer starts far from viable KV space
Vocabulary embeddingsSample from model's word embedding matrixBetter — starts in a region the model recognizes
Real activations (First-k)Run the actual document through the model, take first p KV vectorsBest — warm start in the exact KV space the model produces

First-k initialization (used by cartridges) is the natural extension: the initial KV state comes from actually processing the document, giving the optimizer a warm start in a region of KV space the model already understands. The optimizer then refines these vectors to compress the full document's information into p positions.

From fine-tuning to inference optimization

Original prefix tuning was a fine-tuning technique: train task-specific prefixes to adapt a frozen model. Cartridges repurpose it as an inference optimization: the trained prefix is the document representation, and injecting it replaces prefill entirely.

Prefix Tuning (2021)
Task-specific soft prompts
Cartridges (2025)
Document-specific KV caches

SCI vs First-k

SCI (Sampled Chunk Initialization) samples random 64-token chunks from across the document instead of taking the first p tokens. In theory, this captures broader content. In practice, both strategies produce nearly identical results:

PrefixFirst-k (s)SCI (s)E2E First-k (tok/s)E2E SCI (tok/s)
25666168259.560.0
4,0961,1441,13156.555.9
16,3842,6842,61044.544.2

SCI's broader initialization does not translate into measurable quality or performance advantage over First-k.

vLLM integration

CartridgeConnector plugs into vLLM v0.16.0 as a KVConnectorBase_V1 plugin with zero vLLM core modifications. About 300 lines of Python.

Request lifecycle

1. Server starts
2. Load .pt + prefix IDs
3. Request arrives
4. Match prefix
5. Allocate blocks
6. Scatter KV to GPU
7. Decode

Minimum requirements

RequirementValue
vLLM version≥ 0.16.0 (KVConnectorBase_V1 API)
Artifactscartridge.pt + prefix_token_ids.json
Core patchesNone
Connector size~300 lines Python

Server startup

export CARTRIDGE=/path/to/cartridge.pt
export PREFIX_IDS=/path/to/prefix_token_ids.json

vllm serve Qwen/Qwen2.5-7B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.40 \
  --kv-transfer-config '{
    "kv_connector": "CartridgeConnector",
    "kv_connector_module_path": "vllm.distributed.kv_transfer.kv_connector.v1.cartridge_connector",
    "kv_connector_extra_config": {
      "cartridge_path": "'"$CARTRIDGE"'",
      "prefix_token_ids_path": "'"$PREFIX_IDS"'"
    },
    "kv_role": "kv_both"
  }'

Connector implementation

class CartridgeConnector(KVConnectorBase_V1):
    def __init__(self, rank, local_rank, config):
        self.cartridge = torch.load(config.cartridge_path, map_location="cpu")
        self.prefix_token_ids = load_json(config.prefix_token_ids_path)
        self.layer_k = [layer["k"].contiguous() for layer in self.cartridge]
        self.layer_v = [layer["v"].contiguous() for layer in self.cartridge]

    def get_num_new_matched_tokens(self, request, **kwargs):
        # Compare request tokens against stored prefix, block-aligned
        matched = longest_common_prefix(
            request.prompt_token_ids, self.prefix_token_ids
        )
        return align_down_to_block_size(matched, block_size=16)

    def start_load_kv(self, connector_meta, **kwargs):
        # Scatter pre-trained KV from CPU to vLLM-assigned GPU slots
        slot_mapping = connector_meta["slot_mapping"]
        matched = connector_meta["matched_tokens"]
        for layer_idx in range(len(self.layer_k)):
            scatter_to_gpu_blocks(
                self.layer_k[layer_idx][:matched],
                self.layer_v[layer_idx][:matched],
                slot_mapping[:matched],
            )

Sending requests

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-7B-Instruct",
    "messages": [
      {"role": "system",
       "content": "<full training document text>"},
      {"role": "user",
       "content": "What is the main contribution?"}
    ]
  }'
Key design point: CartridgeConnector does not allocate KV memory. vLLM still owns allocation. The connector only fills the slots vLLM assigned. This integrates cleanly with the scheduler and continuous batching.

Cross-model analysis

PropertyQwen-7BQwen-1.5BLlama-2-7B
Parameters7B1.5B7B
KV Heads4 (GQA)2 (GQA)32 (MHA)
p=256 checkpoint15 MB7 MB128 MB
p=8192 checkpoint470 MB224 MB4,856 MB
Best full score4.54.54.3
Avg p=256 score3.13.51.6

Key findings

GQA models produce dramatically smaller cartridges. 8.5x smaller than MHA at the same prefix. GQA is strongly preferred.

Instruct tuning is critical for compression. Base Llama-2-7b scores 0.0–3.2 at 4x+ compression vs 2.4–4.1 for instruct models.

Small instruct models are surprisingly effective. Qwen-1.5B (1.5B params) often matches or exceeds Qwen-7B (7B params) at compressed cartridge sizes.

Concurrency scaling (H100)

PrefixConc=1Conc=4Conc=16Output TPS (c=1)
4,09631ms43ms75ms166
8,19241ms69ms127ms163
16,38478ms127ms245ms158
32,76856ms136ms261ms158