Mixed Precision Training — Technology Wiki

Overview

Direct Answer

Mixed precision training uses lower-precision (16-bit) floating-point arithmetic for the majority of forward and backward passes, while maintaining higher-precision (32-bit) computations for weight updates and loss scaling. This approach accelerates training without significantly degrading model accuracy, exploiting the numerical tolerance of neural networks to reduced intermediate precision.

How It Works

During each training iteration, activations and gradients are computed in float16, reducing memory bandwidth and enabling faster tensor operations on modern GPUs and TPUs. Critical operations—weight updates, gradient accumulation, and loss scaling—remain in float32 to prevent numerical underflow and maintain convergence stability. Loss scaling temporarily increases loss magnitudes before backpropagation, then scales gradients down before weight updates, preserving gradient information that would otherwise be lost in 16-bit range.

Why It Matters

Organisations adopt this technique to reduce training time by 2–3× and memory consumption by up to 50%, enabling larger batch sizes and faster iteration cycles. Lower memory footprint also permits training of larger models on constrained hardware, directly impacting research velocity and infrastructure costs in computationally intensive domains.

Common Applications

Computer vision models, natural language processing transformers, and recommendation systems routinely employ this technique. Large language model training, image classification pipelines, and object detection frameworks benefit from the speed and memory gains without sacrificing production-grade accuracy.

Key Considerations

Not all models train stably with reduced precision; some loss landscapes exhibit gradient degradation requiring careful hyperparameter tuning. Practitioners must monitor convergence behaviour closely and validate final model accuracy, as numerical precision tradeoffs can manifest unexpectedly in downstream tasks.

Related in Training & Optimisation

Self-Attention

An attention mechanism where each element in a sequence attends to all other elements to compute its representation.

Multi-Head Attention

An attention mechanism that runs multiple attention operations in parallel, capturing different types of relationships.

Residual Network

A deep neural network architecture using skip connections that allow gradients to flow directly through layers, enabling very deep networks.

Layer Normalisation

A normalisation technique that normalises across the features of each individual sample rather than across the batch.

Dropout

A regularisation technique that randomly deactivates neurons during training to prevent co-adaptation and reduce overfitting.

Activation Function

A mathematical function applied to neural network outputs to introduce non-linearity, enabling the learning of complex patterns.

ReLU

Rectified Linear Unit — an activation function that outputs the input directly if positive, otherwise outputs zero.

Sigmoid Function

An activation function that maps input values to a range between 0 and 1, useful for binary classification outputs.

Softmax Function

An activation function that converts a vector of numbers into a probability distribution, commonly used in multi-class classification.

Positional Encoding

A technique that injects information about the position of tokens in a sequence into transformer architectures.

Gradient Clipping

A technique that caps gradient values during training to prevent the exploding gradient problem.

Rotary Positional Encoding

A position encoding method that encodes absolute position with a rotation matrix and naturally incorporates relative position information into attention computations.

More in Deep Learning

Attention Mechanism

Architectures

A neural network component that learns to focus on relevant parts of the input when producing each element of the output.

Deep Learning

Architectures

A subset of machine learning using neural networks with multiple layers to learn hierarchical representations of data.

Fully Connected Layer

Architectures

A neural network layer where every neuron is connected to every neuron in the adjacent layers.

Embedding

Architectures

A learned dense vector representation of discrete data (like words or categories) in a continuous vector space.

Exploding Gradient

Architectures

A problem where gradients grow exponentially during backpropagation, causing unstable weight updates and training failure.

Word Embedding

Language Models

Dense vector representations of words where semantically similar words are mapped to nearby points in vector space.

Capsule Network

Architectures

A neural network architecture that groups neurons into capsules to better capture spatial hierarchies and part-whole relationships.

Key-Value Cache

Architectures

An optimisation in autoregressive transformer inference that stores previously computed key and value tensors to avoid redundant computation during sequential token generation.