Overview
Direct Answer
Rectified Linear Unit (ReLU) is an activation function that applies the transformation f(x) = max(0, x), allowing positive inputs to pass through whilst suppressing all negative values to zero. Its simplicity and computational efficiency make it the dominant activation function in modern deep neural networks.
How It Works
ReLU operates element-wise on the output of each neuron, introducing non-linearity by creating a piecewise linear function with a hard threshold at zero. During backpropagation, gradients flow unattenuated through positive regions (gradient = 1), whilst negative regions contribute no gradient signal (gradient = 0), facilitating faster training compared to sigmoid or tanh functions.
Why It Matters
The function's efficiency reduces computational overhead in large-scale neural networks, enabling faster training and inference across GPU and CPU architectures. Its empirical success in achieving state-of-the-art accuracy on image classification, natural language processing, and reinforcement learning tasks has made it the standard choice for practitioners optimising model performance and training speed.
Common Applications
ReLU is ubiquitous in convolutional neural networks for computer vision, recurrent architectures for sequence modelling, and transformer-based language models. It serves as the default activation in frameworks handling image recognition, autonomous vehicle perception systems, and large language model implementations.
Key Considerations
The 'dying ReLU' problem occurs when neurons become inactive and output zero for all inputs, potentially degrading network capacity. Variants such as Leaky ReLU and GELU have been developed to mitigate this limitation whilst preserving computational benefits.
Cross-References(1)
More in Deep Learning
Deep Learning
ArchitecturesA subset of machine learning using neural networks with multiple layers to learn hierarchical representations of data.
Mixture of Experts
ArchitecturesAn architecture where different specialised sub-networks (experts) are selectively activated based on the input.
Batch Normalisation
ArchitecturesA technique that normalises layer inputs during training to stabilise and accelerate deep neural network learning.
Attention Mechanism
ArchitecturesA neural network component that learns to focus on relevant parts of the input when producing each element of the output.
Flash Attention
ArchitecturesAn IO-aware attention algorithm that reduces memory reads and writes by tiling the attention computation, enabling faster training of long-context transformer models.
Fully Connected Layer
ArchitecturesA neural network layer where every neuron is connected to every neuron in the adjacent layers.
Residual Connection
Training & OptimisationA skip connection that adds a layer's input directly to its output, enabling gradient flow through deep networks and allowing training of architectures with hundreds of layers.
Fine-Tuning
Language ModelsThe process of adapting a pre-trained model to a specific task by continuing training on a smaller task-specific dataset, transferring learned representations to new domains.