Layer Normalisation — Technology Wiki

Overview

Direct Answer

Layer Normalisation is a technique that normalises the activations of a neural network by computing statistics (mean and variance) across the feature dimension for each individual sample, independent of batch composition. This differs from batch normalisation, which normalises across the batch dimension whilst preserving per-sample feature variation.

How It Works

For each sample, the algorithm computes the mean and standard deviation across all features in a given layer, then rescales activations using learnable affine parameters (gain and bias). The normalisation is applied independently to each sample, making it invariant to batch size and batch composition. This approach is particularly effective in recurrent and transformer architectures where temporal or sequential dependencies exist within samples.

Why It Matters

Layer normalisation stabilises training in models where batch statistics are unreliable or unavailable—notably recurrent neural networks, sequence-to-sequence models, and transformer-based architectures. It improves convergence speed, reduces sensitivity to initialisation, and enables robust performance across variable batch sizes, directly enhancing model robustness and training efficiency in production systems.

Common Applications

This technique is foundational in transformer models used for natural language processing, machine translation, and large language models. It is also employed in recurrent architectures for time-series forecasting, speech recognition systems, and reinforcement learning agents where batch normalisation is impractical or ineffective.

Key Considerations

Layer normalisation introduces additional computational overhead per sample and may be less effective than batch normalisation in fully-connected feedforward networks where batch statistics are stable. Performance characteristics vary significantly depending on architecture choice and problem domain, requiring empirical validation during model development.

Related in Training & Optimisation

Self-Attention

An attention mechanism where each element in a sequence attends to all other elements to compute its representation.

Multi-Head Attention

An attention mechanism that runs multiple attention operations in parallel, capturing different types of relationships.

Residual Network

A deep neural network architecture using skip connections that allow gradients to flow directly through layers, enabling very deep networks.

Dropout

A regularisation technique that randomly deactivates neurons during training to prevent co-adaptation and reduce overfitting.

Activation Function

A mathematical function applied to neural network outputs to introduce non-linearity, enabling the learning of complex patterns.

ReLU

Rectified Linear Unit — an activation function that outputs the input directly if positive, otherwise outputs zero.

Sigmoid Function

An activation function that maps input values to a range between 0 and 1, useful for binary classification outputs.

Softmax Function

An activation function that converts a vector of numbers into a probability distribution, commonly used in multi-class classification.

Positional Encoding

A technique that injects information about the position of tokens in a sequence into transformer architectures.

Gradient Clipping

A technique that caps gradient values during training to prevent the exploding gradient problem.

Mixed Precision Training

Training neural networks using both 16-bit and 32-bit floating-point arithmetic to speed up computation while maintaining accuracy.

Rotary Positional Encoding

A position encoding method that encodes absolute position with a rotation matrix and naturally incorporates relative position information into attention computations.

More in Deep Learning

Autoencoder

Architectures

A neural network trained to encode input data into a compressed representation and then decode it back to reconstruct the original.

Word Embedding

Language Models

Dense vector representations of words where semantically similar words are mapped to nearby points in vector space.

Fine-Tuning

Language Models

The process of adapting a pre-trained model to a specific task by continuing training on a smaller task-specific dataset, transferring learned representations to new domains.

Vanishing Gradient

Architectures

A problem in deep networks where gradients become extremely small during backpropagation, preventing earlier layers from learning.

Convolutional Neural Network

Architectures

A deep learning architecture designed for processing structured grid data like images, using convolutional filters to detect features.

Tensor Parallelism

Architectures

A distributed computing strategy that splits individual layer computations across multiple devices by partitioning weight matrices along specific dimensions.

Deep Learning

Architectures

A subset of machine learning using neural networks with multiple layers to learn hierarchical representations of data.

Pre-Training

Language Models

The initial phase of training a deep learning model on a large unlabelled corpus using self-supervised objectives, establishing general-purpose representations for downstream adaptation.