🖨️ Printing Instructions: Press Ctrl/Cmd + P and select "Save as PDF".
1

Attention & Transformers

The Architecture That Powers Modern AI

2

Part 1: The Problem with Sequential Data

3

Sequential Data is Everywhere

4

The Old Approach: Process One at a Time

5

The Breakthrough: Attention Is All You Need

6

Part 2: The Attention Mechanism

7

What is Attention?

8

The Query-Key-Value Framework

9

Scaled Dot-Product Attention

10

Self-Attention: Attending to Yourself

11

Self-Attention in PyTorch

import torch
import torch.nn.functional as F

def self_attention(x, W_q, W_k, W_v):
    """x: (batch, seq_len, d_model)"""
    Q = x @ W_q  # Queries
    K = x @ W_k  # Keys  
    V = x @ W_v  # Values
    
    d_k = K.shape[-1]
    scores = (Q @ K.transpose(-2, -1)) / (d_k ** 0.5)
    attn_weights = F.softmax(scores, dim=-1)
    output = attn_weights @ V
    
    return output, attn_weights
12

Part 3: Multi-Head Attention

13

Why Multiple Heads?

14

Multi-Head Attention Formula

15

Multi-Head Attention in PyTorch

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        self.num_heads = num_heads
        self.d_k = d_model // num_heads
        
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)
    
    def forward(self, x):
        B, L, D = x.shape
        # Project and reshape to (B, num_heads, L, d_k)
        Q = self.W_q(x).view(B, L, self.num_heads, self.d_k).transpose(1, 2)
        K = self.W_k(x).view(B, L, self.num_heads, self.d_k).transpose(1, 2)
        V = self.W_v(x).view(B, L, self.num_heads, self.d_k).transpose(1, 2)
        
        # Scaled dot-product attention
        scores = (Q @ K.transpose(-2, -1)) / (self.d_k ** 0.5)
        attn = F.softmax(scores, dim=-1)
        out = (attn @ V).transpose(1, 2).reshape(B, L, D)
        
        return self.W_o(out)
16

Part 3.5: PyTorch Built-in Layers

17

Using nn.MultiheadAttention

18

Using nn.MultiheadAttention

import torch.nn as nn

# Create the layer
mha = nn.MultiheadAttention(
    embed_dim=512,     # Model dimension
    num_heads=8,       # Number of attention heads
    dropout=0.1,       # Dropout rate
    batch_first=True   # Input shape: (batch, seq, dim)
)

# Forward pass (self-attention: Q=K=V)
x = torch.randn(2, 10, 512)  # batch=2, seq=10
output, attn_weights = mha(x, x, x)
# output: (2, 10, 512)
# attn_weights: (2, 10, 10)
19

From Scratch vs Built-in: When to Use Each

20

Complete Comparison

# FROM SCRATCH (what you implement in assignments)
class SelfAttention(nn.Module):
    def __init__(self, d_model):
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
    
    def forward(self, x):
        Q, K, V = self.W_q(x), self.W_k(x), self.W_v(x)
        scores = Q @ K.transpose(-2, -1) / (Q.size(-1) ** 0.5)
        return F.softmax(scores, dim=-1) @ V

# BUILT-IN (what you use in production)
self.attn = nn.MultiheadAttention(d_model, num_heads)
output, _ = self.attn(x, x, x)  # Same result, optimized!
21

Part 4: Positional Encoding

22

The Position Problem

23

Sinusoidal Positional Encoding

24

Part 5: The Transformer Block

25

Inside a Transformer Block

26

Residual Connections & Layer Norm

27

Feed-Forward Network

28

Transformer Block in PyTorch

class TransformerBlock(nn.Module):
    def __init__(self, d_model, num_heads, d_ff):
        super().__init__()
        self.attn = MultiHeadAttention(d_model, num_heads)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),
            nn.Linear(d_ff, d_model)
        )
        self.ln1 = nn.LayerNorm(d_model)
        self.ln2 = nn.LayerNorm(d_model)
    
    def forward(self, x):
        # Pre-norm style
        x = x + self.attn(self.ln1(x))
        x = x + self.ffn(self.ln2(x))
        return x
29

Part 6: Encoder vs Decoder

30

Two Flavors of Transformers

31

Causal Masking (for Decoders)

32

Causal Mask Implementation

def causal_attention(x, W_q, W_k, W_v):
    Q, K, V = x @ W_q, x @ W_k, x @ W_v
    d_k = K.shape[-1]
    
    scores = (Q @ K.transpose(-2, -1)) / (d_k ** 0.5)
    
    # Create causal mask
    seq_len = x.shape[1]
    mask = torch.triu(torch.ones(seq_len, seq_len), diagonal=1).bool()
    scores = scores.masked_fill(mask, float('-inf'))
    
    attn = F.softmax(scores, dim=-1)
    return attn @ V
33

Part 7: Putting It All Together

34

Full Transformer Architecture

35

Why Transformers Dominate

36

PyTorch's nn.Transformer

37

Building a Transformer in PyTorch

import torch.nn as nn

class SimpleTransformer(nn.Module):
    def __init__(self, vocab_size, d_model=512, nhead=8, num_layers=6):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, d_model)
        self.pos_enc = PositionalEncoding(d_model)  # You implement this!
        
        # Use built-in TransformerEncoder
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=d_model, nhead=nhead, batch_first=True
        )
        self.transformer = nn.TransformerEncoder(encoder_layer, num_layers)
        self.fc_out = nn.Linear(d_model, vocab_size)
    
    def forward(self, x):
        x = self.embed(x)
        x = self.pos_enc(x)
        x = self.transformer(x)
        return self.fc_out(x)
38

All Interactive Demos

39

Summary

40

Key Takeaways

41

Next Lecture: NLP & Tokenization