Overview
Direct Answer
Tensor parallelism is a distributed training strategy that partitions individual weight matrices and activation tensors across multiple devices along specific dimensions, enabling computation of a single model layer to occur in parallel. Unlike data parallelism, which replicates the entire model, this approach reduces memory footprint per device by distributing the mathematical operations of matrix multiplications themselves.
How It Works
During forward and backward propagation, weight matrices are split column-wise or row-wise across devices. Each device computes a partial result on its assigned tensor slice, then results are aggregated through collective operations (e.g. all-reduce). Communication overlaps with computation where feasible, minimising synchronisation overhead. The granularity and axis of partitioning depend on the layer type and target batch size.
Why It Matters
This approach enables training of exceptionally large models that would exceed single-device memory constraints, directly impacting capability and cost-efficiency in large language model and vision transformer development. Organisations prioritise it when model scale exceeds practical limits of other parallelism strategies, particularly when batch sizes cannot be increased freely.
Common Applications
Tensor parallelism is widely deployed in training large transformer-based language models and multimodal systems where model dimension is the primary scaling factor. It is frequently combined with pipeline and data parallelism in systems handling billions of parameters.
Key Considerations
Communication bandwidth between devices becomes a critical bottleneck; synchronous all-reduce operations can introduce substantial latency on slower interconnects. The strategy is most effective on high-bandwidth clusters and less suitable for models with small embedding or hidden dimensions relative to device count.
Cross-References(1)
More in Deep Learning
Fine-Tuning
Language ModelsThe process of adapting a pre-trained model to a specific task by continuing training on a smaller task-specific dataset, transferring learned representations to new domains.
Weight Decay
ArchitecturesA regularisation technique that penalises large model weights during training by adding a fraction of the weight magnitude to the loss function, preventing overfitting.
Fully Connected Layer
ArchitecturesA neural network layer where every neuron is connected to every neuron in the adjacent layers.
Weight Initialisation
ArchitecturesThe strategy for setting initial parameter values in a neural network before training begins.
Attention Head
Training & OptimisationAn individual attention computation within a multi-head attention layer that learns to focus on different aspects of the input, with outputs concatenated for richer representations.
ReLU
Training & OptimisationRectified Linear Unit — an activation function that outputs the input directly if positive, otherwise outputs zero.
Self-Attention
Training & OptimisationAn attention mechanism where each element in a sequence attends to all other elements to compute its representation.
Vanishing Gradient
ArchitecturesA problem in deep networks where gradients become extremely small during backpropagation, preventing earlier layers from learning.