Overview
Training neural networks using both 16-bit and 32-bit floating-point arithmetic to speed up computation while maintaining accuracy.
More in Deep Learning
Long Short-Term Memory
ArchitecturesA recurrent neural network architecture designed to learn long-term dependencies by using gating mechanisms to control information flow.
Graph Neural Network
ArchitecturesA neural network designed to operate on graph-structured data, learning representations of nodes, edges, and entire graphs.
Recurrent Neural Network
ArchitecturesA neural network architecture where connections between nodes form directed cycles, enabling processing of sequential data.
Attention Head
Training & OptimisationAn individual attention computation within a multi-head attention layer that learns to focus on different aspects of the input, with outputs concatenated for richer representations.
Adapter Layers
Language ModelsSmall trainable modules inserted between frozen transformer layers that enable task-specific adaptation without modifying the original model weights.
Flash Attention
ArchitecturesAn IO-aware attention algorithm that reduces memory reads and writes by tiling the attention computation, enabling faster training of long-context transformer models.
Gated Recurrent Unit
ArchitecturesA simplified variant of LSTM that combines the forget and input gates into a single update gate.
Weight Decay
ArchitecturesA regularisation technique that penalises large model weights during training by adding a fraction of the weight magnitude to the loss function, preventing overfitting.