Activation Functions & SwiGLU

Interactive exploration of ReLU, GELU, Swish, and the 3-matrix SwiGLU FFN architecture.

📈 Function Comparison

Compare:
ReLU
f(x) = max(0, x)
Historically the standard. Fast to compute, but completely blocks negative values, preventing neurons from recovering if they drop below zero.
GELU
f(x) = x · Φ(x)
A smoother, probabilistic curve. Widely adopted in early Transformer models like BERT and GPT-2 for better gradient flow.
Swish / SiLU
f(x) = x · σ(x)
Smooth and slightly non-monotonic (dips below zero). This is the exact mathematical function used inside the highly successful SwiGLU gate.

⚡ Derivative (Gradient) Comparison

Show:

ReLU Dead Zone

0.000
Gradient for all x < 0 is exactly zero.
Once dead, a neuron never updates again.

Swish at x = −2

−0.090
Non-zero gradient even for negative inputs.
Neurons can always recover and continue learning.
Why Swish/GELU Gradients Matter: During backpropagation, a neural network updates its weights based on the gradient. If a ReLU neuron receives a large negative input, its gradient becomes 0. With no gradient, the weights are never adjusted—the neuron is permanently "dead." Swish and GELU maintain small but non-zero gradients in the negative space, resolving this issue elegantly.

🔀 The 3-Matrix SwiGLU Architecture

SwiGLU(x) = (Swish(x · W₁) ⊙ (x · W₃)) · W₂
A standard Feed-Forward Network uses 2 matrices (up-projection, down-projection). SwiGLU uses 3 matrices by splitting the up-projection into parallel paths: a Gate (W₁) and a Value (W₃).

1.50

1. GATE PROJECTION (W₁) Swish(x · W₁) — Decides what to let through

⊙ (Element-wise Multiply)

2. VALUE PROJECTION (W₃) x · W₃ — The raw hidden feature data

↓ Yields Gated Hidden State ↓

GATED RESULT Gate ⊙ Value

· W₂ (Matrix Multiplication)
3. DOWN PROJECTION (W₂)
Final Output Value
0.00

Standard FFN (2 Matrices)

FFN(x) = ReLU(x · W₁) · W₂
Uses one projection to size up, applies a hard threshold, and one projection to size down.
Output: 0.00

SwiGLU FFN (3 Matrices)

FFN(x) = (Swish(xW₁) ⊙ xW₃) · W₂
Splits the up-projection to allow continuous, learnable selective gating before scaling down.
Output: 0.00
The Expressivity Advantage: Standard FFNs act as basic threshold filters. By decoupling the activation into two distinct matrices (W₁ and W₃), SwiGLU gains the ability to route and modulate information multiplicatively. (Note: To keep overall parameter counts identical to standard FFNs, modern architectures typically reduce the hidden layer dimension slightly when adopting SwiGLU).