Batch Normalisation — Technology Wiki

Overview

Direct Answer

Batch normalisation is a technique that rescales and recentres the inputs to each layer during neural network training by normalising activations across a mini-batch. This approach reduces internal covariate shift—the phenomenon where the distribution of layer inputs changes during training—thereby enabling faster convergence and improved stability.

How It Works

For each mini-batch, the technique computes the mean and variance of activations across training samples, then standardises these values using z-score normalisation. Learnable scale and shift parameters (gamma and beta) are then applied per feature, allowing the network to recover expressivity. During inference, a running estimate of population statistics computed from training batches replaces the mini-batch statistics.

Why It Matters

Normalisation dramatically accelerates training convergence, reduces sensitivity to weight initialisation, and enables use of higher learning rates, directly reducing time-to-deployment and computational cost. Organisations deploying deep learning systems benefit from improved model stability and generalisation performance, particularly when training on large datasets.

Common Applications

The technique is standard in convolutional neural networks for image classification, object detection, and computer vision pipelines. It is equally prevalent in recurrent architectures and transformer-based language models, where it stabilises training of very deep networks across NLP and recommendation systems.

Key Considerations

Batch normalisation introduces a dependency on batch size; very small batches produce unreliable statistics whilst very large batches reduce computational efficiency. The distinction between training and inference behaviour requires careful implementation, and layer normalisation or group normalisation may be preferable in certain contexts such as recurrent networks or variable-batch settings.

Cross-References(1)

Deep Learning

Neural Network

Related in Architectures

Deep Learning

A subset of machine learning using neural networks with multiple layers to learn hierarchical representations of data.

Neural Network

A computing system inspired by biological neural networks, consisting of interconnected nodes that process information in layers.

Convolutional Neural Network

A deep learning architecture designed for processing structured grid data like images, using convolutional filters to detect features.

Recurrent Neural Network

A neural network architecture where connections between nodes form directed cycles, enabling processing of sequential data.

Long Short-Term Memory

A recurrent neural network architecture designed to learn long-term dependencies by using gating mechanisms to control information flow.

Gated Recurrent Unit

A simplified variant of LSTM that combines the forget and input gates into a single update gate.

Transformer

A neural network architecture based entirely on attention mechanisms, eliminating recurrence and enabling parallel processing of sequences.

Attention Mechanism

A neural network component that learns to focus on relevant parts of the input when producing each element of the output.

Encoder-Decoder Architecture

A neural network design where an encoder processes input into a fixed representation and a decoder generates output from it.

Autoencoder

A neural network trained to encode input data into a compressed representation and then decode it back to reconstruct the original.

Variational Autoencoder

A generative model that learns a probabilistic latent space representation, enabling generation of new data samples.

Embedding

A learned dense vector representation of discrete data (like words or categories) in a continuous vector space.

More in Deep Learning

Skip Connection

Architectures

A neural network shortcut that allows the output of one layer to bypass intermediate layers and be added to a later layer's output.

Vanishing Gradient

Architectures

A problem in deep networks where gradients become extremely small during backpropagation, preventing earlier layers from learning.

Adapter Layers

Language Models

Small trainable modules inserted between frozen transformer layers that enable task-specific adaptation without modifying the original model weights.

Tensor Parallelism

Architectures

A distributed computing strategy that splits individual layer computations across multiple devices by partitioning weight matrices along specific dimensions.

Rotary Positional Encoding

Training & Optimisation

A position encoding method that encodes absolute position with a rotation matrix and naturally incorporates relative position information into attention computations.

Layer Normalisation

Training & Optimisation

A normalisation technique that normalises across the features of each individual sample rather than across the batch.

Residual Network

Training & Optimisation

A deep neural network architecture using skip connections that allow gradients to flow directly through layers, enabling very deep networks.

Mixed Precision Training

Training & Optimisation

Training neural networks using both 16-bit and 32-bit floating-point arithmetic to speed up computation while maintaining accuracy.