$\color{black}\rule{365px}{3px}$

Gaussian Error Linear Unit (GELU) is an activation function originally introduced by Dan Hendrycks and Kevin Gimpel in the paper “Gaussian Error Linear Units (GELUs)”. It has become popular in modern neural networks (e.g., many Transformer-based models such as BERT use GELU) due to its smoothness and empirically good performance.

Definition


Gaussian Error Linear Unit (GELU) is defined as:

$$ GELU(x) = xP(X ≤ x) = xΦ(x) = x · \frac{1}{2}\left[1 + erf(x/\sqrt(2)\right] $$

image.png

Approximation


If greater feedforward speed is worth the cost of exactness, GELU can be approximated by

$$ 0.5x(1 + \tanh[ \sqrt{2/π}(x + 0.044715x^3)]) $$

or

$$ xσ(1.702x)

$$

How Authors Came Up With GELU