Cache past keys & values โ process only the new token at each step instead of the entire sequence
A 2-layer transformer generates tokens. Watch how the KV-cache grows and only the new token is processed at each decode step.
Without KV-cache, every generated token requires reprocessing the entire sequence from scratch. With caching, only 1 new token is projected per step.
Prefill processes all prompt tokens in parallel (compute-bound). Decode generates one token at a time, reading the growing KV-cache (memory-bandwidth bound).
The cache grows linearly with each generated token. Configure your model to see how quickly memory fills up.