Activation Functions in Neural Networks

Activation functions play a crucial role in neural networks by introducing non-linearity, enabling the network to learn complex patterns. While the sigmoid function has traditionally been popular, especially for binary classification, it's not always the best choice for all layers due to problems like slow gradient descent, also known as the vanishing gradient problem. There are alternatives though.

Activation Functions

Linear Function

Linear activation functions don't really do anything. Can't do backpropagation, Linear activation function will output linear activations. Whatever hidden layers you add, the activation will be always linear like logistic regression (So its useless in a lot of complex problems)

You might use linear activation function in one place - in the output layer if the output is real numbers (regression problem). But even in this case if the output value is non-negative you could use RELU instead.

Binary Step Function

It's on or off, can't handle multiple classification - it's binary after all. The vertical slopes don't work well with calculus!!

Activation binary step

Non-Linearity

Instead we need non-linear activation functions. These can create complex mappings between inputs and outputs and allows backpropagation (because they have a useful derivative). It allows for multiple layers (linear functions degenerate to a single layer). For this we have the Following

Sigmoid Function

The sigmoid function squashes its input to a range between 0 and 1, making it suitable for binary classification:

A = 1 / (1 + np.exp(-z))  # Sigmoid activation, where z is the input matrix

Its output is ideal for probabilities, as you can categorize predictions into two classes based on a threshold, typically 0.5.

Sigmoid

Tanh Function

The tanh function is a scaled version of the sigmoid, with outputs ranging from -1 to 1:

# Tanh activation using explicit calculation
A = (np.exp(z) - np.exp(-z)) / (np.exp(z) + np.exp(-z))
 
# Tanh activation using NumPy's built-in function
A = np.tanh(z)

Tanh centers data around zero for the next layer, which can benefit learning in hidden layers. However, both tanh and sigmoid suffer from the vanishing gradient problem when inputs become large in magnitude, because the slope becomes near-zero. Tanh generally preferred over sigmoid

Tanh

ReLU Function

ReLU, short for Rectified Linear Unit, addresses some of the issues of sigmoid and tanh. It provides a linear response for positive inputs and zero for negative inputs:

RELU = np.maximum(0, z)  # ReLU activation

This function accelerates convergence during gradient descent because it doesn't saturate for positive values.

Leaky ReLU Function

Leaky ReLU is a variation that allows for a small, non-zero gradient when the input is negative:

Leaky_RELU = np.maximum(0.01 * z, z)  # Leaky ReLU activation, with 0.01 as the small slope for z < 0

Parametric ReLU (PReLU)

ReLU, but the slope in the negative part is learned via backpropagation. It is complicated and your mileage may vary. The slope in the negative part is learned via back propagation itself. So as we're going back and learning all the weights between the nodes in our neural network, we're also learning the optimal slope for that negative portion of the ReLU activation function.The downside of course, is that this is very complicated and computationally intensive, so we're kind of throwing away some of the benefits of ReLU. Different kinds of problems will benefit from this more.

Exponential Linear Unit (ELU)

Activation elu.svg

This uses the exponential function on the negative side.

Swish

Swish was discovered using automatic search methods, specifically with the goal of discovering new activation functions that would perform well. Swish generally performs worse than ReLU in deep learning models - especially for tasks like machine translation.

Swish is, essentially, a smooth function that interpolates between a linear function and ReLU non-linearly. This interpolation is controlled by Swish’s parameter $\beta$ , which is trainable. Swish is similar to ReLU in some ways - especially as we increase the value of $\beta$ , but like GELU is differentiable at zero.

Gaussian Error Linear Unit (GELU)

If you combine the effect of ReLU, zone out, and dropout, you get GELU. One of ReLU’s limitations is that it’s non-differentiable at zero - GELU resolves this issue, and routinely yields a higher test accuracy than other activation functions. GELU is now quite popular, and is the activation function used by OpenAI in their GPT series of models.

xP(X \leq x) = x\Phi(x) = \frac{1}{2}x\left[1 + \text{erf}\left(\frac{x}{\sqrt{2}}\right)\right]

The error function is a mathematical function that arises in probability, statistics, and partial differential equations describing diffusion. It is defined as:

\text{erf}(x) = \frac{2}{\sqrt{\pi}}\int_0^x e^{-t^2} dt

GELU has a smoother, more continuous shape than the ReLU function, which can make it more effective at learning complex patterns in the data.

ReLU and GELU are both continuous and differentiable, which makes them easy to optimize during training. However, there are some key differences between the two functions.

One of the main differences between the ReLU and GELU functions is their shape. The ReLU function is a step function that outputs 0 for negative input values and the input value for positive input values. In contrast, the GELU function has a smooth, bell-shaped curve that is similar to the sigmoid function. This difference in shape can affect the way the two functions behave in different situations.

Another key difference between the ReLU and GELU functions is their behavior when the input values are close to 0. The ReLU function outputs 0 for any input value that is less than or equal to 0, which can make it difficult for the network to learn in these regions. In contrast, the GELU function has a non-zero gradient at x = 0, which allows the network to learn in this region. This can make the GELU function more effective at learning complex patterns in the data.

Visualization of the Gaussian Error Linear Unit (GELU)

Choosing Activation Functions

A general guideline is to use sigmoid for the output layer if your task is binary classification and ReLU for hidden layers. However, you should experiment with different activation functions to find the best fit for your specific problem.

Need for Non-linear Activation Functions

Without non-linear activation functions, a neural network would behave like a single-layer logistic regression model, no matter how many layers it had. This linearity limits the network's ability to capture complex patterns. However, for regression tasks where the output is a real number, a linear activation function might be used in the output layer. If the output is non-negative, ReLU might be a better choice.

Derivatives of Activation Functions

In backpropagation, we compute the derivatives of these activation functions.

Sigmoid Derivative

For the sigmoid function $g(z) = \frac{1}{1 + e^{-z}}$ , its derivative is:

g'(z) = g(z) \cdot (1 - g(z))

Tanh Derivative

For the tanh function $g(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}$ , its derivative is:

g'(z) = 1 - \tanh(z)^2 = 1 - g(z)^2

ReLU Derivative

For the ReLU function $g(z) = \max(0, z)$ , its derivative is:

g'(z) = \begin{cases} 0 & \text{if } z < 0 \\ 1 & \text{if } z \geq 0 \end{cases}

Leaky ReLU Derivative

For the leaky ReLU function $g(z) = \max(0.01z, z)$ , its derivative is:

g'(z) = \begin{cases} 0.01 & \text{if } z < 0 \\ 1 & \text{if } z \geq 0 \end{cases}

These derivatives are essential for the learning process, allowing the neural network to update the weights in the direction that reduces the loss.

Resources:

Deep Learning 101: Transformer Activation Functions Explainer - Sigmoid, ReLU, GELU, Swish (opens in a new tab)

Shallow Neural Networks Gradient Descent