Activation Functions & SwiGLU

Interactive exploration of ReLU, GELU, Swish, and the SwiGLU gating mechanism

πŸ“ˆ Function Comparison

Show:
ReLU
f(x) = max(0, x)
Simple but causes "dead neurons" β€” zero gradient for all x < 0
GELU
f(x) = x Β· Ξ¦(x)
Smooth, probabilistic gating. Used in BERT and GPT-2
Swish / SiLU
f(x) = x Β· Οƒ(x)
Smooth, non-monotonic. Used inside SwiGLU in all modern LLMs

⚑ Derivative Comparison

Show:

ReLU Dead Zone

0.000
Gradient for all x < 0
Once dead, a neuron never recovers

Swish at x = βˆ’2

βˆ’0.090
Non-zero gradient even for negative inputs
Neurons can always recover
Why this matters: In deep networks, if a ReLU neuron receives a large negative input (e.g., from a bad weight update), its gradient becomes exactly 0. With zero gradient, the weights can never update β€” the neuron is permanently "dead." Swish and GELU have small but non-zero gradients everywhere, so neurons can always recover.

πŸ”€ SwiGLU Gating Mechanism

SwiGLU(x) = (Swish(xW₁) βŠ™ xW₃) Β· Wβ‚‚ β€” the gate controls which features pass through

1.50

GATE Swish(x Β· W₁) β€” which features to open

βŠ™

VALUE x Β· W₃ β€” what information is available

↓

OUTPUT gate βŠ™ value β€” gated result

Standard FFN (ReLU)

ReLU(x Β· W₁) β€” binary on/off, no gating

SwiGLU FFN

Swish(x Β· W₁) βŠ™ x Β· W₃ β€” smooth, selective gating
Key insight: ReLU is a hard switch β€” features are either fully on or fully off. SwiGLU uses the Swish gate as a smooth dimmer: each hidden feature has an independent, learnable gate that controls how much of the corresponding value passes through. This gives the FFN much richer, more expressive control over information flow.