🔲 Transformer Block

Step through the components of a Transformer layer

Input Embeddings
Block 1 of 12
Layer Norm 1
Multi-Head Attention
+ Residual
Layer Norm 2
Feed-Forward Network
+ Residual
Output / Next Block

📋 Component Details

Click Next Step to begin
Watch data flow through each component of a Transformer block. Click on any component to jump to it.
Input Embeddings + Position
Token IDs are converted to embedding vectors and added with positional encodings. Each token becomes a d-dimensional vector (e.g., 768 for BERT, 4096 for LLaMA).
x = token_embed(tokens) + pos_embed(positions)
# x shape: (batch, seq_len, d_model)
Layer Normalization 1
Normalizes each token's features to have zero mean and unit variance. Crucial for stable training of deep networks.
x_norm = LayerNorm(x)
# Computes: (x - mean) / std * gamma + beta
# Per-token normalization across features
Multi-Head Self-Attention
Each token attends to all tokens. Multiple heads capture different relationship types (syntax, semantics, coreference). This is where tokens "communicate" with each other.
Q, K, V = x @ W_q, x @ W_k, x @ W_v
attn = softmax(Q @ K.T / sqrt(d_k)) @ V
# Multiple heads concatenated and projected
Residual Connection 1
Adds the original input to the attention output. This "skip connection" helps gradients flow through deep networks and allows the model to learn identity mappings.
x = x + Attention(LayerNorm(x))
# Also called "Pre-norm" style
Layer Normalization 2
Second normalization before the feed-forward network. Same operation as LN1 but with different learned parameters.
x_norm = LayerNorm(x) # Different params from LN1
Feed-Forward Network (MLP)
Two linear layers with GELU activation. Expands to 4× width then projects back. Processes each token independently — this is where "thinking" happens!
ffn = Linear(d_model, 4*d_model) # Expand
ffn = GELU(ffn)
ffn = Linear(4*d_model, d_model) # Project back
Residual Connection 2
Adds the FFN output to its input. The block is now complete and output goes to the next Transformer block (or final layer norm).
x = x + FFN(LayerNorm(x))
Output to Next Block
The output becomes the input to the next Transformer block. Modern LLMs stack many blocks: GPT-2 has 12-48, LLaMA has 32-80, GPT-4 has ~96!
# Stack N blocks
for block in transformer_blocks:
x = block(x) # Repeated N times

🚶 Data Flow Progress

📄 Full TransformerBlock in PyTorch
import torch import torch.nn as nn import torch.nn.functional as F class TransformerBlock(nn.Module): def __init__(self, d_model, num_heads, d_ff, dropout=0.1): super().__init__() # Multi-Head Self-Attention self.attn = nn.MultiheadAttention( d_model, num_heads, dropout=dropout, batch_first=True ) # Feed-Forward Network (MLP) self.ffn = nn.Sequential( nn.Linear(d_model, d_ff), # Expand: d → 4d nn.GELU(), # Activation nn.Dropout(dropout), nn.Linear(d_ff, d_model), # Project: 4d → d nn.Dropout(dropout) ) # Layer Normalizations self.ln1 = nn.LayerNorm(d_model) self.ln2 = nn.LayerNorm(d_model) def forward(self, x, mask=None): # x shape: (batch, seq_len, d_model) # Pre-norm + Attention + Residual x_norm = self.ln1(x) attn_out, _ = self.attn(x_norm, x_norm, x_norm, attn_mask=mask) x = x + attn_out # Residual connection # Pre-norm + FFN + Residual x_norm = self.ln2(x) ffn_out = self.ffn(x_norm) x = x + ffn_out # Residual connection return x
📊 Usage Example
# Create a Transformer block block = TransformerBlock( d_model=768, # Hidden dimension num_heads=12, # Attention heads d_ff=3072, # FFN hidden dim (4x) dropout=0.1 ) # Example input batch_size = 2 seq_len = 128 x = torch.randn(batch_size, seq_len, 768) # Forward pass output = block(x) print(output.shape) # (2, 128, 768)
📐 Tensor Shapes Throughout
Input: (batch, seq_len, d_model) = (2, 128, 768) After LN1: (2, 128, 768) After Attn: (2, 128, 768) After Add1: (2, 128, 768) # x + attn_out After LN2: (2, 128, 768) FFN Hidden: (2, 128, 3072) # Expanded 4x After FFN: (2, 128, 768) # Projected back After Add2: (2, 128, 768) # x + ffn_out Output: (2, 128, 768) # Same as input!
�� Key Design Choices
Pre-norm: LN before sublayers (more stable) • Residual: x + sublayer(x) enables deep stacking • 4x FFN: Expansion gives capacity for learning • GELU: Smoother than ReLU, better for NLP • Multi-head: Parallel attention patterns