Input Embeddings
↓
Block 1 of 12
Layer Norm 1
↓
Multi-Head Attention
↓
+ Residual
↓
Layer Norm 2
↓
Feed-Forward Network
↓
+ Residual
↓
Output / Next Block
📋 Component Details
Click Next Step to begin
Watch data flow through each component of a Transformer block.
Click on any component to jump to it.
Input Embeddings + Position
Token IDs are converted to embedding vectors and added with
positional encodings. Each token becomes a d-dimensional vector (e.g., 768 for BERT,
4096 for LLaMA).
x = token_embed(tokens) + pos_embed(positions)
# x shape: (batch, seq_len, d_model)
# x shape: (batch, seq_len, d_model)
Layer Normalization 1
Normalizes each token's features to have zero mean and unit
variance. Crucial for stable training of deep networks.
x_norm = LayerNorm(x)
# Computes: (x - mean) / std * gamma + beta
# Per-token normalization across features
# Computes: (x - mean) / std * gamma + beta
# Per-token normalization across features
Multi-Head Self-Attention
Each token attends to all tokens. Multiple heads capture different
relationship types (syntax, semantics, coreference). This is where tokens "communicate"
with each other.
Q, K, V = x @ W_q, x @ W_k, x @ W_v
attn = softmax(Q @ K.T / sqrt(d_k)) @ V
# Multiple heads concatenated and projected
attn = softmax(Q @ K.T / sqrt(d_k)) @ V
# Multiple heads concatenated and projected
Residual Connection 1
Adds the original input to the attention output. This "skip
connection" helps gradients flow through deep networks and allows the model to learn
identity mappings.
x = x + Attention(LayerNorm(x))
# Also called "Pre-norm" style
# Also called "Pre-norm" style
Layer Normalization 2
Second normalization before the feed-forward network. Same
operation as LN1 but with different learned parameters.
x_norm = LayerNorm(x) # Different params from LN1
Feed-Forward Network (MLP)
Two linear layers with GELU activation. Expands to 4× width then
projects back. Processes each token independently — this is where "thinking" happens!
ffn = Linear(d_model, 4*d_model) # Expand
ffn = GELU(ffn)
ffn = Linear(4*d_model, d_model) # Project back
ffn = GELU(ffn)
ffn = Linear(4*d_model, d_model) # Project back
Residual Connection 2
Adds the FFN output to its input. The block is now complete and
output goes to the next Transformer block (or final layer norm).
x = x + FFN(LayerNorm(x))
Output to Next Block
The output becomes the input to the next Transformer block. Modern
LLMs stack many blocks: GPT-2 has 12-48, LLaMA has 32-80, GPT-4 has ~96!
# Stack N blocks
for block in transformer_blocks:
x = block(x) # Repeated N times
for block in transformer_blocks:
x = block(x) # Repeated N times