Overview
Direct Answer
Pipeline parallelism is a distributed training technique that partitions neural network layers across multiple devices and processes overlapping micro-batches through sequential stages to reduce idle time and maximise device utilisation. Unlike data parallelism, which replicates the full model across devices, this approach divides the model itself into stages that operate concurrently on different micro-batches.
How It Works
Each device holds a distinct set of consecutive layers, forming a pipeline stage. During forward propagation, micro-batch 1 advances through stage 1 while micro-batch 2 enters stage 1 and micro-batch 3 waits at the pipeline entrance. Backward propagation follows similarly, allowing devices to compute gradients while upstream stages process new data, thereby overlapping computation and communication to reduce bubble time—periods when devices remain idle waiting for dependencies.
Why It Matters
This approach enables training of extremely large models that exceed single-device memory capacity, directly reducing training time and hardware costs for organisations developing large language models and vision transformers. It addresses the memory bottleneck that prevents scaling beyond device VRAM limits, making feasible the training of multi-billion-parameter systems that would otherwise require prohibitively expensive hardware.
Common Applications
Pipeline parallelism is widely deployed in large-scale language model training by research institutions and cloud providers. It is essential for transformer-based architectures with 10+ billion parameters, particularly in natural language processing and multimodal AI development where models exceed individual GPU or TPU memory constraints.
Key Considerations
Pipeline bubble—idle device time between forward and backward passes—remains a fundamental efficiency loss; bubble fraction increases with deeper pipelines and smaller micro-batches. Practitioners must balance micro-batch size, pipeline depth, and gradient accumulation steps to optimise throughput whilst maintaining convergence behaviour and numerical stability.
Cross-References(3)
More in Deep Learning
Self-Attention
Training & OptimisationAn attention mechanism where each element in a sequence attends to all other elements to compute its representation.
Pooling Layer
ArchitecturesA neural network layer that reduces spatial dimensions by aggregating values, commonly using max or average operations.
LoRA
Language ModelsLow-Rank Adaptation — a parameter-efficient fine-tuning technique that adds trainable low-rank matrices to frozen pretrained weights.
Key-Value Cache
ArchitecturesAn optimisation in autoregressive transformer inference that stores previously computed key and value tensors to avoid redundant computation during sequential token generation.
Mixed Precision Training
Training & OptimisationTraining neural networks using both 16-bit and 32-bit floating-point arithmetic to speed up computation while maintaining accuracy.
Vanishing Gradient
ArchitecturesA problem in deep networks where gradients become extremely small during backpropagation, preventing earlier layers from learning.
Dropout
Training & OptimisationA regularisation technique that randomly deactivates neurons during training to prevent co-adaptation and reduce overfitting.
Layer Normalisation
Training & OptimisationA normalisation technique that normalises across the features of each individual sample rather than across the batch.