ReLU — Technology Wiki

Overview

Direct Answer

Rectified Linear Unit (ReLU) is an activation function that applies the transformation f(x) = max(0, x), allowing positive inputs to pass through whilst suppressing all negative values to zero. Its simplicity and computational efficiency make it the dominant activation function in modern deep neural networks.

How It Works

ReLU operates element-wise on the output of each neuron, introducing non-linearity by creating a piecewise linear function with a hard threshold at zero. During backpropagation, gradients flow unattenuated through positive regions (gradient = 1), whilst negative regions contribute no gradient signal (gradient = 0), facilitating faster training compared to sigmoid or tanh functions.

Why It Matters

The function's efficiency reduces computational overhead in large-scale neural networks, enabling faster training and inference across GPU and CPU architectures. Its empirical success in achieving state-of-the-art accuracy on image classification, natural language processing, and reinforcement learning tasks has made it the standard choice for practitioners optimising model performance and training speed.

Common Applications

ReLU is ubiquitous in convolutional neural networks for computer vision, recurrent architectures for sequence modelling, and transformer-based language models. It serves as the default activation in frameworks handling image recognition, autonomous vehicle perception systems, and large language model implementations.

Key Considerations

The 'dying ReLU' problem occurs when neurons become inactive and output zero for all inputs, potentially degrading network capacity. Variants such as Leaky ReLU and GELU have been developed to mitigate this limitation whilst preserving computational benefits.

Cross-References(1)

Deep Learning

Activation Function

Related in Training & Optimisation

Self-Attention

An attention mechanism where each element in a sequence attends to all other elements to compute its representation.

Multi-Head Attention

An attention mechanism that runs multiple attention operations in parallel, capturing different types of relationships.

Residual Network

A deep neural network architecture using skip connections that allow gradients to flow directly through layers, enabling very deep networks.

Layer Normalisation

A normalisation technique that normalises across the features of each individual sample rather than across the batch.

Dropout

A regularisation technique that randomly deactivates neurons during training to prevent co-adaptation and reduce overfitting.

Activation Function

A mathematical function applied to neural network outputs to introduce non-linearity, enabling the learning of complex patterns.

Sigmoid Function

An activation function that maps input values to a range between 0 and 1, useful for binary classification outputs.

Softmax Function

An activation function that converts a vector of numbers into a probability distribution, commonly used in multi-class classification.

Positional Encoding

A technique that injects information about the position of tokens in a sequence into transformer architectures.

Gradient Clipping

A technique that caps gradient values during training to prevent the exploding gradient problem.

Mixed Precision Training

Training neural networks using both 16-bit and 32-bit floating-point arithmetic to speed up computation while maintaining accuracy.

Rotary Positional Encoding

A position encoding method that encodes absolute position with a rotation matrix and naturally incorporates relative position information into attention computations.

More in Deep Learning

Deep Learning

Architectures

A subset of machine learning using neural networks with multiple layers to learn hierarchical representations of data.

Mixture of Experts

Architectures

An architecture where different specialised sub-networks (experts) are selectively activated based on the input.

Batch Normalisation

Architectures

A technique that normalises layer inputs during training to stabilise and accelerate deep neural network learning.

Attention Mechanism

Architectures

A neural network component that learns to focus on relevant parts of the input when producing each element of the output.

Flash Attention

Architectures

An IO-aware attention algorithm that reduces memory reads and writes by tiling the attention computation, enabling faster training of long-context transformer models.

Fully Connected Layer

Architectures

A neural network layer where every neuron is connected to every neuron in the adjacent layers.

Residual Connection

Training & Optimisation

A skip connection that adds a layer's input directly to its output, enabling gradient flow through deep networks and allowing training of architectures with hundreds of layers.

Fine-Tuning

Language Models

The process of adapting a pre-trained model to a specific task by continuing training on a smaller task-specific dataset, transferring learned representations to new domains.