Grouped Query Attention (GQA)

Interactive exploration — sharing KV heads for massive memory savings with negligible quality loss

🧠 MHA → GQA → MQA: Head Sharing Spectrum

Each Q head asks a unique question. K and V heads provide the memory that Q attends to. GQA shares KV heads across groups of Q heads.

8
GQA 4:1 ratio
Q Head (unique per head)
K Head (shared in group)
V Head (shared in group)
Group boundary
How to read this: Each column is one attention head slot. In MHA, every column has its own Q, K, V. In GQA, the Q heads remain unique (diverse "questions"), but K and V heads are shared within groups (compressed "memory"). Drag the KV heads slider to see the spectrum from MHA (n_kv = n_q) to MQA (n_kv = 1).
MHA
n_kv = n_q = 32
Full independence
GQA-8
n_kv = 8, groups of 4
The sweet spot
MQA
n_kv = 1
Maximum compression

💾 KV-Cache Memory Calculator

During autoregressive generation, K and V from all previous tokens are cached. GQA dramatically reduces this cache by sharing KV heads.

📊 Memory Scaling with Context Length

Watch how KV-cache grows with sequence length — GQA keeps it manageable even at 128K+ context.

MHA (n_kv = n_q)
GQA (n_kv = 8)
MQA (n_kv = 1)
Exceeds 80GB GPU

⚙️ Projection Parameter Comparison

GQA reduces the K and V projection matrices — Q and output projections remain full-size. Adjust embed_dim and head counts to see the impact.

8
Key insight: The Q and Output projections are always full-size (embed_dim × embed_dim). Only K and V shrink proportionally with n_kv. Since attention projections are ~30% of total model parameters, GQA gives modest parameter savings but massive inference memory savings via KV-cache reduction.

🏗️ Frontier Model Configurations

All major LLMs use GQA. Click any row to load its configuration into the memory calculator.

Model Type Layers Q Heads KV Heads d_model d_head Ratio KV Savings

📐 Visual Comparison: KV-Cache at 128K Context

KV-cache size per model at 128K sequence length (FP16). Red line = 80 GB (single H100/A100 GPU limit).