Overview
Direct Answer
Exploding gradient is a numerical instability during backpropagation in which gradients accumulate multiplicatively across layers, reaching excessively large values that cause weight updates to overshoot optimal parameters and destabilise training. This phenomenon is distinct from vanishing gradients and occurs most frequently in recurrent neural networks and very deep feedforward architectures.
How It Works
During backpropagation, gradients are computed via the chain rule by multiplying partial derivatives across layers. When activation function derivatives and weight matrices have values greater than one, successive multiplications produce exponentially growing gradient magnitudes. In recurrent networks, unrolling across many timesteps amplifies this effect, leading to NaN or Inf values in weight updates that render the model untrainable within a few iterations.
Why It Matters
Training instability directly increases computational cost through failed training runs and necessitates careful hyperparameter selection. In production pipelines, unstable training reduces model reliability and increases time-to-deployment for sequential architectures used in natural language processing and time-series forecasting, where recurrence is fundamental.
Common Applications
The problem is prevalent in long short-term memory networks, gated recurrent units, and multi-layer perceptrons exceeding 10–20 layers. Applications include machine translation, speech recognition, and financial forecasting where temporal dependencies require deep or recurrent architectures.
Key Considerations
Gradient clipping and normalisation techniques mitigate the issue but introduce hyperparameter tuning overhead. The severity depends on weight initialisation strategy and activation function choice, requiring practitioners to balance architectural expressiveness against training stability.
Cross-References(1)
Referenced By1 term mentions Exploding Gradient
Other entries in the wiki whose definition references Exploding Gradient — useful for understanding how this concept connects across Deep Learning and adjacent domains.
More in Deep Learning
Rotary Positional Encoding
Training & OptimisationA position encoding method that encodes absolute position with a rotation matrix and naturally incorporates relative position information into attention computations.
Diffusion Model
Generative ModelsA generative model that learns to reverse a gradual noising process, generating high-quality samples from random noise.
Activation Function
Training & OptimisationA mathematical function applied to neural network outputs to introduce non-linearity, enabling the learning of complex patterns.
Mixture of Experts
ArchitecturesAn architecture where different specialised sub-networks (experts) are selectively activated based on the input.
State Space Model
ArchitecturesA sequence modelling architecture based on continuous-time dynamical systems that processes long sequences with linear complexity, offering an alternative to attention-based transformers.
ReLU
Training & OptimisationRectified Linear Unit — an activation function that outputs the input directly if positive, otherwise outputs zero.
Positional Encoding
Training & OptimisationA technique that injects information about the position of tokens in a sequence into transformer architectures.
Data Parallelism
ArchitecturesA distributed training strategy that replicates the model across multiple devices and divides training data into batches processed simultaneously, synchronising gradients after each step.