Transformer Block Visualization

Input Embeddings

↓

Blocks (N):

Block 1 of 12

Layer Norm 1

↓

Multi-Head Attention

↓

+ Residual

↓

Layer Norm 2

↓

Feed-Forward Network

↓

+ Residual

↓

Output / Next Block

📋 Component Details

Click Next Step to begin

Watch data flow through each component of a Transformer block. Click on any component to jump to it.

Input Embeddings + Position

Token IDs are converted to embedding vectors and added with positional encodings. Each token becomes a d-dimensional vector (e.g., 768 for BERT, 4096 for LLaMA).

x = token_embed(tokens) + pos_embed(positions)
# x shape: (batch, seq_len, d_model)

Layer Normalization 1

Normalizes each token's features to have zero mean and unit variance. Crucial for stable training of deep networks.

x_norm = LayerNorm(x)
# Computes: (x - mean) / std * gamma + beta
# Per-token normalization across features

Multi-Head Self-Attention

Each token attends to all tokens. Multiple heads capture different relationship types (syntax, semantics, coreference). This is where tokens "communicate" with each other.

Q, K, V = x @ W_q, x @ W_k, x @ W_v
attn = softmax(Q @ K.T / sqrt(d_k)) @ V
# Multiple heads concatenated and projected

Residual Connection 1

Adds the original input to the attention output. This "skip connection" helps gradients flow through deep networks and allows the model to learn identity mappings.

x = x + Attention(LayerNorm(x))
# Also called "Pre-norm" style

Layer Normalization 2

Second normalization before the feed-forward network. Same operation as LN1 but with different learned parameters.

x_norm = LayerNorm(x) # Different params from LN1

Feed-Forward Network (MLP)

Two linear layers with GELU activation. Expands to 4× width then projects back. Processes each token independently — this is where "thinking" happens!

ffn = Linear(d_model, 4*d_model) # Expand
ffn = GELU(ffn)
ffn = Linear(4*d_model, d_model) # Project back

Residual Connection 2

Adds the FFN output to its input. The block is now complete and output goes to the next Transformer block (or final layer norm).

x = x + FFN(LayerNorm(x))

Output to Next Block

The output becomes the input to the next Transformer block. Modern LLMs stack many blocks: GPT-2 has 12-48, LLaMA has 32-80, GPT-4 has ~96!

# Stack N blocks
for block in transformer_blocks:
x = block(x) # Repeated N times

🚶 Data Flow Progress

📄 Full TransformerBlock in PyTorch

import torch
import torch.nn as nn
import torch.nn.functional as F

class TransformerBlock(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()

        # Multi-Head Self-Attention
        self.attn = nn.MultiheadAttention(
            d_model, num_heads, dropout=dropout, batch_first=True
        )

        # Feed-Forward Network (MLP)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff),   # Expand: d → 4d
            nn.GELU(),                   # Activation
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model),   # Project: 4d → d
            nn.Dropout(dropout)
        )

        # Layer Normalizations
        self.ln1 = nn.LayerNorm(d_model)
        self.ln2 = nn.LayerNorm(d_model)

    def forward(self, x, mask=None):
        # x shape: (batch, seq_len, d_model)

        # Pre-norm + Attention + Residual
        x_norm = self.ln1(x)
        attn_out, _ = self.attn(x_norm, x_norm, x_norm, attn_mask=mask)
        x = x + attn_out  # Residual connection

        # Pre-norm + FFN + Residual
        x_norm = self.ln2(x)
        ffn_out = self.ffn(x_norm)
        x = x + ffn_out  # Residual connection

        return x

📊 Usage Example

# Create a Transformer block
block = TransformerBlock(
    d_model=768,    # Hidden dimension
    num_heads=12,   # Attention heads
    d_ff=3072,      # FFN hidden dim (4x)
    dropout=0.1
)

# Example input
batch_size = 2
seq_len = 128
x = torch.randn(batch_size, seq_len, 768)

# Forward pass
output = block(x)
print(output.shape)  # (2, 128, 768)

📐 Tensor Shapes Throughout

Input: (batch, seq_len, d_model) = (2, 128, 768) After LN1: (2, 128, 768) After Attn: (2, 128, 768) After Add1: (2, 128, 768) # x + attn_out After LN2: (2, 128, 768) FFN Hidden: (2, 128, 3072) # Expanded 4x After FFN: (2, 128, 768) # Projected back After Add2: (2, 128, 768) # x + ffn_out Output: (2, 128, 768) # Same as input!

�� Key Design Choices

• Pre-norm: LN before sublayers (more stable) • Residual: x + sublayer(x) enables deep stacking • 4x FFN: Expansion gives capacity for learning • GELU: Smoother than ReLU, better for NLP • Multi-head: Parallel attention patterns

🔲 Transformer Block

📋 Component Details

🚶 Data Flow Progress