KV-Cache: Efficient Autoregressive Generation

Cache past keys & values — process only the new token at each step instead of the entire sequence

🎬 Watch KV-Cache Build Token by Token

A 2-layer transformer generates tokens. Watch how the KV-cache grows and only the new token is processed at each decode step.

—

Step 0

Prompt (prefill)

New token

Cached K/V

New Q (attends to all K)

📊 Total Tokens Processed: Cache vs No Cache

Without KV-cache, every generated token requires reprocessing the entire sequence from scratch. With caching, only 1 new token is projected per step.

Prompt length:

Generate tokens:

⚡ Prefill vs Decode: Two Different Bottlenecks

Prefill processes all prompt tokens in parallel (compute-bound). Decode generates one token at a time, reading the growing KV-cache (memory-bandwidth bound).

Prompt length:

2000

Generate tokens:

500

Prefill Phase
• Process entire prompt in parallel — like a training forward pass
• All K, V computed and cached in one shot
• Compute-bound: GPU cores are the bottleneck (matrix multiplications)
• High GPU utilization, fast per-token throughput

Decode Phase
• Generate one token at a time — inherently sequential
• Each step: 1 new Q attends to the entire growing cache
• Memory-bandwidth bound: reading KV-cache from HBM is the bottleneck
• Low GPU compute utilization — most time spent on memory reads

💾 KV-Cache Memory Growth During Generation

The cache grows linearly with each generated token. Configure your model to see how quickly memory fills up.

Layers:

KV Heads:

Head dim:

Precision:

Max seq:

32K