ReLU — Technology Wiki

Overview

Rectified Linear Unit — an activation function that outputs the input directly if positive, otherwise outputs zero.

Cross-References(1)

Deep Learning

Activation Function

Related in Training & Optimisation

Self-Attention

An attention mechanism where each element in a sequence attends to all other elements to compute its representation.

Multi-Head Attention

An attention mechanism that runs multiple attention operations in parallel, capturing different types of relationships.

Residual Network

A deep neural network architecture using skip connections that allow gradients to flow directly through layers, enabling very deep networks.

Layer Normalisation

A normalisation technique that normalises across the features of each individual sample rather than across the batch.

Dropout

A regularisation technique that randomly deactivates neurons during training to prevent co-adaptation and reduce overfitting.

Activation Function

A mathematical function applied to neural network outputs to introduce non-linearity, enabling the learning of complex patterns.

Sigmoid Function

An activation function that maps input values to a range between 0 and 1, useful for binary classification outputs.

Softmax Function

An activation function that converts a vector of numbers into a probability distribution, commonly used in multi-class classification.

Positional Encoding

A technique that injects information about the position of tokens in a sequence into transformer architectures.

Gradient Clipping

A technique that caps gradient values during training to prevent the exploding gradient problem.

Mixed Precision Training

Training neural networks using both 16-bit and 32-bit floating-point arithmetic to speed up computation while maintaining accuracy.

Rotary Positional Encoding

A position encoding method that encodes absolute position with a rotation matrix and naturally incorporates relative position information into attention computations.

More in Deep Learning

Parameter-Efficient Fine-Tuning

Language Models

Methods for adapting large pretrained models to new tasks by only updating a small fraction of their parameters.

Data Parallelism

Architectures

A distributed training strategy that replicates the model across multiple devices and divides training data into batches processed simultaneously, synchronising gradients after each step.

Batch Normalisation

Architectures

A technique that normalises layer inputs during training to stabilise and accelerate deep neural network learning.

Deep Learning

Architectures

A subset of machine learning using neural networks with multiple layers to learn hierarchical representations of data.

Convolutional Layer

Architectures

A neural network layer that applies learnable filters across input data to detect local patterns and features.

Encoder-Decoder Architecture

Architectures

A neural network design where an encoder processes input into a fixed representation and a decoder generates output from it.

Skip Connection

Architectures

A neural network shortcut that allows the output of one layer to bypass intermediate layers and be added to a later layer's output.

Pretraining

Architectures

Training a model on a large general dataset before fine-tuning it on a specific downstream task.