Interactive exploration of ReLU, GELU, Swish, and the SwiGLU gating mechanism
π Function Comparison
Show:
ReLU
f(x) = max(0, x)
Simple but causes "dead neurons" β zero gradient for all x < 0
GELU
f(x) = x Β· Ξ¦(x)
Smooth, probabilistic gating. Used in BERT and GPT-2
Swish / SiLU
f(x) = x Β· Ο(x)
Smooth, non-monotonic. Used inside SwiGLU in all modern LLMs
β‘ Derivative Comparison
Show:
ReLU Dead Zone
0.000
Gradient for all x < 0 Once dead, a neuron never recovers
Swish at x = β2
β0.090
Non-zero gradient even for negative inputs Neurons can always recover
Why this matters: In deep networks, if a ReLU neuron receives a large negative input (e.g., from a bad weight update), its gradient becomes exactly 0. With zero gradient, the weights can never update β the neuron is permanently "dead." Swish and GELU have small but non-zero gradients everywhere, so neurons can always recover.
π SwiGLU Gating Mechanism
SwiGLU(x) = (Swish(xWβ) β xWβ) Β· Wβ β the gate controls which features pass through
Key insight: ReLU is a hard switch β features are either fully on or fully off. SwiGLU uses the Swish gate as a smooth dimmer: each hidden feature has an independent, learnable gate that controls how much of the corresponding value passes through. This gives the FFN much richer, more expressive control over information flow.