Grouped Query Attention (GQA)

Interactive exploration — sharing KV heads for massive memory savings with negligible quality loss

🧠 MHA → GQA → MQA: Head Sharing Spectrum

Each Q head asks a unique question. K and V heads provide the memory that Q attends to. GQA shares KV heads across groups of Q heads.

Q Heads:

KV Heads:

GQA 4:1 ratio

Q Head (unique per head)

K Head (shared in group)

V Head (shared in group)

Group boundary

How to read this: Each column is one attention head slot. In MHA, every column has its own Q, K, V. In GQA, the Q heads remain unique (diverse "questions"), but K and V heads are shared within groups (compressed "memory"). Drag the KV heads slider to see the spectrum from MHA (n_kv = n_q) to MQA (n_kv = 1).

MHA

n_kv = n_q = 32
Full independence

GQA-8

n_kv = 8, groups of 4
The sweet spot

MQA

n_kv = 1
Maximum compression

💾 KV-Cache Memory Calculator

During autoregressive generation, K and V from all previous tokens are cached. GQA dramatically reduces this cache by sharing KV heads.

Layers:

Q Heads:

Head dim:

Seq length:

Precision:

📊 Memory Scaling with Context Length

Watch how KV-cache grows with sequence length — GQA keeps it manageable even at 128K+ context.

MHA (n_kv = n_q)

GQA (n_kv = 8)

MQA (n_kv = 1)

Exceeds 80GB GPU

⚙️ Projection Parameter Comparison

GQA reduces the K and V projection matrices — Q and output projections remain full-size. Adjust embed_dim and head counts to see the impact.

Embed dim:

Q Heads:

KV Heads:

Key insight: The Q and Output projections are always full-size (embed_dim × embed_dim). Only K and V shrink proportionally with n_kv. Since attention projections are ~30% of total model parameters, GQA gives modest parameter savings but massive inference memory savings via KV-cache reduction.

🏗️ Frontier Model Configurations

All major LLMs use GQA. Click any row to load its configuration into the memory calculator.

Model	Type	Layers	Q Heads	KV Heads	d_model	d_head	Ratio	KV Savings

📐 Visual Comparison: KV-Cache at 128K Context

KV-cache size per model at 128K sequence length (FP16). Red line = 80 GB (single H100/A100 GPU limit).