Overview
A memory optimisation that trades computation for memory by recomputing intermediate activations during the backward pass instead of storing them all during the forward pass.
More in Deep Learning
Self-Attention
Training & OptimisationAn attention mechanism where each element in a sequence attends to all other elements to compute its representation.
Activation Function
Training & OptimisationA mathematical function applied to neural network outputs to introduce non-linearity, enabling the learning of complex patterns.
Exploding Gradient
ArchitecturesA problem where gradients grow exponentially during backpropagation, causing unstable weight updates and training failure.
Weight Initialisation
ArchitecturesThe strategy for setting initial parameter values in a neural network before training begins.
Word Embedding
Language ModelsDense vector representations of words where semantically similar words are mapped to nearby points in vector space.
Vision Transformer
ArchitecturesA transformer architecture adapted for image recognition that divides images into patches and processes them as sequences, rivalling convolutional networks in visual tasks.
Flash Attention
ArchitecturesAn IO-aware attention algorithm that reduces memory reads and writes by tiling the attention computation, enabling faster training of long-context transformer models.
Attention Head
Training & OptimisationAn individual attention computation within a multi-head attention layer that learns to focus on different aspects of the input, with outputs concatenated for richer representations.