KV-Cache: Efficient Autoregressive Generation

Cache past keys & values โ€” process only the new token at each step instead of the entire sequence

๐ŸŽฌ Watch KV-Cache Build Token by Token

A 2-layer transformer generates tokens. Watch how the KV-cache grows and only the new token is processed at each decode step.

โ€”
Step 0
Prompt (prefill)
New token
Cached K/V
New Q (attends to all K)

๐Ÿ“Š Total Tokens Processed: Cache vs No Cache

Without KV-cache, every generated token requires reprocessing the entire sequence from scratch. With caching, only 1 new token is projected per step.

20
50

โšก Prefill vs Decode: Two Different Bottlenecks

Prefill processes all prompt tokens in parallel (compute-bound). Decode generates one token at a time, reading the growing KV-cache (memory-bandwidth bound).

2000
500
Prefill Phase
โ€ข Process entire prompt in parallel โ€” like a training forward pass
โ€ข All K, V computed and cached in one shot
โ€ข Compute-bound: GPU cores are the bottleneck (matrix multiplications)
โ€ข High GPU utilization, fast per-token throughput
Decode Phase
โ€ข Generate one token at a time โ€” inherently sequential
โ€ข Each step: 1 new Q attends to the entire growing cache
โ€ข Memory-bandwidth bound: reading KV-cache from HBM is the bottleneck
โ€ข Low GPU compute utilization โ€” most time spent on memory reads

๐Ÿ’พ KV-Cache Memory Growth During Generation

The cache grows linearly with each generated token. Configure your model to see how quickly memory fills up.

32K