🖨️ Printing Instructions: Press Ctrl/Cmd + P and select "Save as PDF".
1

GLM-5

from Vibe Coding to Agentic Engineering

2

Learning Objectives

3

Part 1: The GLM-5 Architecture

4

Introduction to GLM-5

5

The Scaling Challenge

6

GLM-5's Unique Contributions

7

Part 2: DeepSeek Sparse Attention (DSA)

8

The Attention Problem

9

DSA: Origin and Integration

10

DSA Scoring Formula

function DSAScore(Q, K, W_weights)
  // Q: indexer queries, K: cached indexer keys
  
  // 1. Calculate base dot product
  base_scores = Q * K^T / sqrt(d_index)
  
  // 2. Weight by head importance
  head_weights = x * W_weights / sqrt(H)
  
  // 3. Final score is weighted sum across heads
  final_score = sum(base_scores * head_weights)
  
  return top_k(final_score, k=2048)
11

Part 3: Multi-head Latent Attention (MLA)

12

MLA: Origin and Purpose

13

MLA Compression Strategy

14

MLA Key-Value Projections

python
# Simplified MLA KV projection path
class MLA_KV(nn.Module):
    def __init__(self, hidden, lora_rank, full_dim):
        # Down-project to low rank (e.g., 512)
        self.down = nn.Linear(hidden, lora_rank)
        self.norm = RMSNorm(lora_rank)
        # Up-project to full heads (e.g., 64 heads * 256 dim)
        self.up = nn.Linear(lora_rank, full_dim)
        
    def forward(self, x):
        # Compress and cache this tiny tensor!
        c_kv = self.norm(self.down(x))
        
        # Dynamically reconstruct full K and V when needed
        full_kv = self.up(c_kv)
        return c_kv, full_kv
15

Part 4: Mixture of Experts (MoE)

16

Dense vs. Sparse Processing

17

GLM-5 Routing Mechanism

18

GLM-5 MoE Architecture

python
class GlmMoeDsaMoE(nn.Module):
    def forward(self, hidden_states: torch.Tensor):
        # 1. Router assigns tokens to top-k experts
        router_logits = self.gate(hidden_states)
        indices, weights = self.route_tokens(router_logits) # Sigmoid + Bias
        
        # 2. Process through selected routed experts (sparse)
        routed_out = self.experts(
            hidden_states, indices, weights
        )
        
        # 3. Process through shared expert (dense, every token)
        shared_out = self.shared_experts(hidden_states)
        
        # 4. Combine outputs
        return routed_out + shared_out
19

Summary

20

All Interactive Demos

21

Lecture Summary

22

Supplementary Resources