Gradient Clipping — Technology Wiki

Overview

A technique that caps gradient values during training to prevent the exploding gradient problem.

Cross-References(1)

Deep Learning

Exploding Gradient

Related in Training & Optimisation

Self-Attention

An attention mechanism where each element in a sequence attends to all other elements to compute its representation.

Multi-Head Attention

An attention mechanism that runs multiple attention operations in parallel, capturing different types of relationships.

Residual Network

A deep neural network architecture using skip connections that allow gradients to flow directly through layers, enabling very deep networks.

Layer Normalisation

A normalisation technique that normalises across the features of each individual sample rather than across the batch.

Dropout

A regularisation technique that randomly deactivates neurons during training to prevent co-adaptation and reduce overfitting.

Activation Function

A mathematical function applied to neural network outputs to introduce non-linearity, enabling the learning of complex patterns.

ReLU

Rectified Linear Unit — an activation function that outputs the input directly if positive, otherwise outputs zero.

Sigmoid Function

An activation function that maps input values to a range between 0 and 1, useful for binary classification outputs.

Softmax Function

An activation function that converts a vector of numbers into a probability distribution, commonly used in multi-class classification.

Positional Encoding

A technique that injects information about the position of tokens in a sequence into transformer architectures.

Mixed Precision Training

Training neural networks using both 16-bit and 32-bit floating-point arithmetic to speed up computation while maintaining accuracy.

Rotary Positional Encoding

A position encoding method that encodes absolute position with a rotation matrix and naturally incorporates relative position information into attention computations.

More in Deep Learning

Contrastive Learning

Architectures

A self-supervised learning approach that trains models by comparing similar and dissimilar pairs of data representations.

Gated Recurrent Unit

Architectures

A simplified variant of LSTM that combines the forget and input gates into a single update gate.

Tensor Parallelism

Architectures

A distributed computing strategy that splits individual layer computations across multiple devices by partitioning weight matrices along specific dimensions.

Word Embedding

Language Models

Dense vector representations of words where semantically similar words are mapped to nearby points in vector space.

Long Short-Term Memory

Architectures

A recurrent neural network architecture designed to learn long-term dependencies by using gating mechanisms to control information flow.

Skip Connection

Architectures

A neural network shortcut that allows the output of one layer to bypass intermediate layers and be added to a later layer's output.

Pre-Training

Language Models

The initial phase of training a deep learning model on a large unlabelled corpus using self-supervised objectives, establishing general-purpose representations for downstream adaptation.

Prefix Tuning

Language Models

A parameter-efficient method that prepends trainable continuous vectors to the input of each transformer layer, guiding model behaviour without altering base parameters.