Overview
Direct Answer
Weight initialisation is the process of assigning initial numerical values to the learnable parameters of a neural network prior to training. The choice of initialisation strategy directly influences convergence speed, final model performance, and the probability of reaching poor local minima.
How It Works
Different initialisation schemes assign parameter values according to statistical distributions tailored to network architecture. Common approaches include Xavier (Glorot) initialisation, which scales values based on the number of neurons in connected layers, and He initialisation, which adjusts variance for networks using ReLU activations. The goal is to maintain stable gradient flow throughout backpropagation by preventing activations from becoming excessively large or small.
Why It Matters
Poor initialisation can cause training to stall, diverge, or converge slowly, increasing computational cost and time-to-deployment. Appropriate initialisation reduces the risk of vanishing or exploding gradients, enabling faster convergence and better generalisation—critical factors in resource-constrained production environments.
Common Applications
Weight initialisation is applied across convolutional neural networks for image classification, recurrent networks for sequential data processing, and transformer models for natural language understanding. Medical imaging, autonomous systems, and recommendation engines all depend on effective initialisation to achieve reliable performance.
Key Considerations
Optimal initialisation strategies vary by activation function, network depth, and architecture type; no single approach is universally optimal. Transfer learning and pre-trained models circumvent initialisation challenges but introduce dependency on source domain similarity.
Cross-References(2)
More in Deep Learning
Graph Neural Network
ArchitecturesA neural network designed to operate on graph-structured data, learning representations of nodes, edges, and entire graphs.
Capsule Network
ArchitecturesA neural network architecture that groups neurons into capsules to better capture spatial hierarchies and part-whole relationships.
Layer Normalisation
Training & OptimisationA normalisation technique that normalises across the features of each individual sample rather than across the batch.
Pretraining
ArchitecturesTraining a model on a large general dataset before fine-tuning it on a specific downstream task.
Representation Learning
ArchitecturesThe automatic discovery of data representations needed for feature detection or classification from raw data.
Embedding
ArchitecturesA learned dense vector representation of discrete data (like words or categories) in a continuous vector space.
Contrastive Learning
ArchitecturesA self-supervised learning approach that trains models by comparing similar and dissimilar pairs of data representations.
Pre-Training
Language ModelsThe initial phase of training a deep learning model on a large unlabelled corpus using self-supervised objectives, establishing general-purpose representations for downstream adaptation.