Interactive exploration — sharing KV heads for massive memory savings with negligible quality loss
Each Q head asks a unique question. K and V heads provide the memory that Q attends to. GQA shares KV heads across groups of Q heads.
During autoregressive generation, K and V from all previous tokens are cached. GQA dramatically reduces this cache by sharing KV heads.
Watch how KV-cache grows with sequence length — GQA keeps it manageable even at 128K+ context.
GQA reduces the K and V projection matrices — Q and output projections remain full-size. Adjust embed_dim and head counts to see the impact.
All major LLMs use GQA. Click any row to load its configuration into the memory calculator.
| Model | Type | Layers | Q Heads | KV Heads | d_model | d_head | Ratio | KV Savings |
|---|
KV-cache size per model at 128K sequence length (FP16). Red line = 80 GB (single H100/A100 GPU limit).