Overview
A problem where gradients grow exponentially during backpropagation, causing unstable weight updates and training failure.
Cross-References(1)
More in Deep Learning
Data Parallelism
ArchitecturesA distributed training strategy that replicates the model across multiple devices and divides training data into batches processed simultaneously, synchronising gradients after each step.
LoRA
Language ModelsLow-Rank Adaptation — a parameter-efficient fine-tuning technique that adds trainable low-rank matrices to frozen pretrained weights.
Sigmoid Function
Training & OptimisationAn activation function that maps input values to a range between 0 and 1, useful for binary classification outputs.
Model Parallelism
ArchitecturesA distributed training approach that partitions a model across multiple devices, enabling training of models too large to fit in a single accelerator's memory.
Prefix Tuning
Language ModelsA parameter-efficient method that prepends trainable continuous vectors to the input of each transformer layer, guiding model behaviour without altering base parameters.
Tensor Parallelism
ArchitecturesA distributed computing strategy that splits individual layer computations across multiple devices by partitioning weight matrices along specific dimensions.
Fine-Tuning
ArchitecturesThe process of taking a pretrained model and further training it on a smaller, task-specific dataset.
Mixture of Experts
ArchitecturesAn architecture where different specialised sub-networks (experts) are selectively activated based on the input.