🏗️ Full Transformer Architecture

Click any component to see its implementation

📝 Token Embedding

↓

📍 Positional Encoding

↓

🔲 Encoder Block

Multi-Head Attention

↓

Feed-Forward Network

× N (Stack 6-96 blocks)

↓

📊 Output Projection

Select a Component

Click to explore

Click on any component in the architecture diagram to see its PyTorch implementation and details.

📝 Token Embedding

↓

📍 Positional Encoding

↓

🔲 Decoder Block

Causal Self-Attention

↓

Feed-Forward Network

× N (Stack 6-96 blocks)

↓

🎯 LM Head (Next Token)

Select a Component

Click to explore

Click on any component to see its implementation. Note how the decoder uses causal masking to prevent looking at future tokens.