Overview
Direct Answer
Vanishing gradient is a training pathology in deep neural networks where gradients computed during backpropagation shrink exponentially as they propagate backwards through layers, approaching zero and effectively halting weight updates in earlier layers. This prevents shallow layers from learning meaningful representations and is particularly acute in recurrent and very deep feedforward architectures.
How It Works
During backpropagation, gradients are multiplied together across layers via the chain rule. When activation functions like sigmoid or tanh compress outputs to small ranges and have small derivatives, successive multiplications produce increasingly tiny values. In recurrent networks, the same weight matrix is applied repeatedly across time steps, compounding this attenuation effect and leaving parameters from distant time steps unable to adjust.
Why It Matters
Training convergence becomes prohibitively slow or stalls entirely, increasing computational cost and time-to-model without improving accuracy. This directly impacts feasibility of training deeper architectures that could capture more complex patterns, limiting model capacity and performance on tasks requiring hierarchical feature learning.
Common Applications
Deep convolutional networks for image recognition, recurrent networks for sequence modelling in natural language processing and time-series forecasting, and encoder-decoder architectures for machine translation and speech recognition suffer most acutely from this problem.
Key Considerations
Modern mitigation techniques including ReLU activation functions, batch normalisation, residual connections, and gradient clipping have substantially reduced prevalence, though the underlying issue remains relevant for architecture design and hyperparameter selection in very deep models.
Cross-References(1)
More in Deep Learning
Attention Head
Training & OptimisationAn individual attention computation within a multi-head attention layer that learns to focus on different aspects of the input, with outputs concatenated for richer representations.
Mamba Architecture
ArchitecturesA selective state space model that achieves transformer-level performance with linear-time complexity by incorporating input-dependent selection mechanisms into the recurrence.
Self-Attention
Training & OptimisationAn attention mechanism where each element in a sequence attends to all other elements to compute its representation.
Tensor Parallelism
ArchitecturesA distributed computing strategy that splits individual layer computations across multiple devices by partitioning weight matrices along specific dimensions.
Gradient Checkpointing
ArchitecturesA memory optimisation that trades computation for memory by recomputing intermediate activations during the backward pass instead of storing them all during the forward pass.
Positional Encoding
Training & OptimisationA technique that injects information about the position of tokens in a sequence into transformer architectures.
Pretraining
ArchitecturesTraining a model on a large general dataset before fine-tuning it on a specific downstream task.
Mixed Precision Training
Training & OptimisationTraining neural networks using both 16-bit and 32-bit floating-point arithmetic to speed up computation while maintaining accuracy.