Mixture of Experts (GLM-5)

Sigmoid Routing, Bias Correction, and Shared Experts

📝 Input Tokens

GLM-5 Routing Steps:

1. Sigmoid(logits): Scores experts independently [0,1].
2. + Bias: Adds load-balancing bias for selection.
3. Top-K: Selects top 2 of 8 experts (in real GLM-5: 8 of 256).
4. Weights: Uses original sigmoid scores (no bias) to weight outputs.
5. Shared Expert: ALWAYS added to the final output.
🧠

Shared Expert (Always On)

Processes every single token to capture general knowledge and maintain context continuity. Bypasses the router entirely.

Routed Experts Top-2 Selected

Sigmoid Router Scores

Raw Sigmoid Score
With Bias (Used for Selection)