🔀 The 3-Matrix SwiGLU Architecture
SwiGLU(x) = (Swish(x · W₁) ⊙ (x · W₃)) · W₂
A standard Feed-Forward Network uses 2 matrices (up-projection, down-projection). SwiGLU uses 3 matrices by splitting the up-projection into parallel paths: a Gate (W₁) and a Value (W₃).
1. GATE PROJECTION (W₁) Swish(x · W₁) — Decides what to let through
⊙ (Element-wise Multiply)
2. VALUE PROJECTION (W₃) x · W₃ — The raw hidden feature data
↓ Yields Gated Hidden State ↓
GATED RESULT Gate ⊙ Value
· W₂ (Matrix Multiplication)
3. DOWN PROJECTION (W₂)
Final Output Value
0.00
Standard FFN (2 Matrices)
FFN(x) = ReLU(x · W₁) · W₂
Uses one projection to size up, applies a hard threshold, and one projection to size down.
Output: 0.00
SwiGLU FFN (3 Matrices)
FFN(x) = (Swish(xW₁) ⊙ xW₃) · W₂
Splits the up-projection to allow continuous, learnable selective gating before scaling down.
Output: 0.00
The Expressivity Advantage: Standard FFNs act as basic threshold filters. By decoupling the activation into two distinct matrices (W₁ and W₃), SwiGLU gains the ability to route and modulate information multiplicatively. (Note: To keep overall parameter counts identical to standard FFNs, modern architectures typically reduce the hidden layer dimension slightly when adopting SwiGLU).