Vanishing Gradient

Overview

Direct Answer

Vanishing gradient is a training pathology in deep neural networks where gradients computed during backpropagation shrink exponentially as they propagate backwards through layers, approaching zero and effectively halting weight updates in earlier layers. This prevents shallow layers from learning meaningful representations and is particularly acute in recurrent and very deep feedforward architectures.

How It Works

During backpropagation, gradients are multiplied together across layers via the chain rule. When activation functions like sigmoid or tanh compress outputs to small ranges and have small derivatives, successive multiplications produce increasingly tiny values. In recurrent networks, the same weight matrix is applied repeatedly across time steps, compounding this attenuation effect and leaving parameters from distant time steps unable to adjust.

Why It Matters

Training convergence becomes prohibitively slow or stalls entirely, increasing computational cost and time-to-model without improving accuracy. This directly impacts feasibility of training deeper architectures that could capture more complex patterns, limiting model capacity and performance on tasks requiring hierarchical feature learning.

Common Applications

Deep convolutional networks for image recognition, recurrent networks for sequence modelling in natural language processing and time-series forecasting, and encoder-decoder architectures for machine translation and speech recognition suffer most acutely from this problem.

Key Considerations

Modern mitigation techniques including ReLU activation functions, batch normalisation, residual connections, and gradient clipping have substantially reduced prevalence, though the underlying issue remains relevant for architecture design and hyperparameter selection in very deep models.

Cross-References(1)

Machine Learning

Backpropagation

Related in Architectures

Deep Learning

A subset of machine learning using neural networks with multiple layers to learn hierarchical representations of data.

Neural Network

A computing system inspired by biological neural networks, consisting of interconnected nodes that process information in layers.

Convolutional Neural Network

A deep learning architecture designed for processing structured grid data like images, using convolutional filters to detect features.

Recurrent Neural Network

A neural network architecture where connections between nodes form directed cycles, enabling processing of sequential data.

Long Short-Term Memory

A recurrent neural network architecture designed to learn long-term dependencies by using gating mechanisms to control information flow.

Gated Recurrent Unit

A simplified variant of LSTM that combines the forget and input gates into a single update gate.

Transformer

A neural network architecture based entirely on attention mechanisms, eliminating recurrence and enabling parallel processing of sequences.

Attention Mechanism

A neural network component that learns to focus on relevant parts of the input when producing each element of the output.

Encoder-Decoder Architecture

A neural network design where an encoder processes input into a fixed representation and a decoder generates output from it.

Autoencoder

A neural network trained to encode input data into a compressed representation and then decode it back to reconstruct the original.

Variational Autoencoder

A generative model that learns a probabilistic latent space representation, enabling generation of new data samples.

Batch Normalisation

A technique that normalises layer inputs during training to stabilise and accelerate deep neural network learning.

More in Deep Learning

Attention Head

Training & Optimisation

An individual attention computation within a multi-head attention layer that learns to focus on different aspects of the input, with outputs concatenated for richer representations.

Mamba Architecture

Architectures

A selective state space model that achieves transformer-level performance with linear-time complexity by incorporating input-dependent selection mechanisms into the recurrence.

Self-Attention

Training & Optimisation

An attention mechanism where each element in a sequence attends to all other elements to compute its representation.

Tensor Parallelism

Architectures

A distributed computing strategy that splits individual layer computations across multiple devices by partitioning weight matrices along specific dimensions.

Gradient Checkpointing

Architectures

A memory optimisation that trades computation for memory by recomputing intermediate activations during the backward pass instead of storing them all during the forward pass.

Positional Encoding

Training & Optimisation

A technique that injects information about the position of tokens in a sequence into transformer architectures.

Pretraining

Architectures

Training a model on a large general dataset before fine-tuning it on a specific downstream task.

Mixed Precision Training

Training & Optimisation

Training neural networks using both 16-bit and 32-bit floating-point arithmetic to speed up computation while maintaining accuracy.

Overview

Direct Answer

How It Works

Why It Matters

Common Applications

Key Considerations

Cross-References(1)

Related in Architectures

Deep Learning

Neural Network

Convolutional Neural Network

Recurrent Neural Network

Long Short-Term Memory

Gated Recurrent Unit

Transformer

Attention Mechanism

Encoder-Decoder Architecture

Autoencoder

Variational Autoencoder

Batch Normalisation

More in Deep Learning

Attention Head

Mamba Architecture

Self-Attention

Tensor Parallelism

Gradient Checkpointing

Positional Encoding

Pretraining

Mixed Precision Training

See Also

Backpropagation