Overview
A form of model parallelism that splits neural network layers across devices and pipelines micro-batches through stages, maximising hardware utilisation during training.
Cross-References(3)
More in Deep Learning
Flash Attention
ArchitecturesAn IO-aware attention algorithm that reduces memory reads and writes by tiling the attention computation, enabling faster training of long-context transformer models.
Sigmoid Function
Training & OptimisationAn activation function that maps input values to a range between 0 and 1, useful for binary classification outputs.
Diffusion Model
Generative ModelsA generative model that learns to reverse a gradual noising process, generating high-quality samples from random noise.
Key-Value Cache
ArchitecturesAn optimisation in autoregressive transformer inference that stores previously computed key and value tensors to avoid redundant computation during sequential token generation.
Representation Learning
ArchitecturesThe automatic discovery of data representations needed for feature detection or classification from raw data.
Model Parallelism
ArchitecturesA distributed training approach that partitions a model across multiple devices, enabling training of models too large to fit in a single accelerator's memory.
Pooling Layer
ArchitecturesA neural network layer that reduces spatial dimensions by aggregating values, commonly using max or average operations.
Pretraining
ArchitecturesTraining a model on a large general dataset before fine-tuning it on a specific downstream task.