🏗️ Full Transformer Architecture

Click any component to see its implementation

📝 Token Embedding
📍 Positional Encoding
🔲 Encoder Block
Multi-Head Attention
Feed-Forward Network
× N (Stack 6-96 blocks)
📊 Output Projection

Select a Component

Click to explore

Click on any component in the architecture diagram to see its PyTorch implementation and details.

📝 Token Embedding
📍 Positional Encoding
🔲 Decoder Block
Causal Self-Attention
Feed-Forward Network
× N (Stack 6-96 blocks)
🎯 LM Head (Next Token)

Select a Component

Click to explore

Click on any component to see its implementation. Note how the decoder uses causal masking to prevent looking at future tokens.