Tensor Parallelism

Overview

Direct Answer

Tensor parallelism is a distributed training strategy that partitions individual weight matrices and activation tensors across multiple devices along specific dimensions, enabling computation of a single model layer to occur in parallel. Unlike data parallelism, which replicates the entire model, this approach reduces memory footprint per device by distributing the mathematical operations of matrix multiplications themselves.

How It Works

During forward and backward propagation, weight matrices are split column-wise or row-wise across devices. Each device computes a partial result on its assigned tensor slice, then results are aggregated through collective operations (e.g. all-reduce). Communication overlaps with computation where feasible, minimising synchronisation overhead. The granularity and axis of partitioning depend on the layer type and target batch size.

Why It Matters

This approach enables training of exceptionally large models that would exceed single-device memory constraints, directly impacting capability and cost-efficiency in large language model and vision transformer development. Organisations prioritise it when model scale exceeds practical limits of other parallelism strategies, particularly when batch sizes cannot be increased freely.

Common Applications

Tensor parallelism is widely deployed in training large transformer-based language models and multimodal systems where model dimension is the primary scaling factor. It is frequently combined with pipeline and data parallelism in systems handling billions of parameters.

Key Considerations

Communication bandwidth between devices becomes a critical bottleneck; synchronous all-reduce operations can introduce substantial latency on slower interconnects. The strategy is most effective on high-bandwidth clusters and less suitable for models with small embedding or hidden dimensions relative to device count.

Cross-References(1)

Business & Strategy

Strategy

Related in Architectures

Deep Learning

A subset of machine learning using neural networks with multiple layers to learn hierarchical representations of data.

Neural Network

A computing system inspired by biological neural networks, consisting of interconnected nodes that process information in layers.

Convolutional Neural Network

A deep learning architecture designed for processing structured grid data like images, using convolutional filters to detect features.

Recurrent Neural Network

A neural network architecture where connections between nodes form directed cycles, enabling processing of sequential data.

Long Short-Term Memory

A recurrent neural network architecture designed to learn long-term dependencies by using gating mechanisms to control information flow.

Gated Recurrent Unit

A simplified variant of LSTM that combines the forget and input gates into a single update gate.

Transformer

A neural network architecture based entirely on attention mechanisms, eliminating recurrence and enabling parallel processing of sequences.

Attention Mechanism

A neural network component that learns to focus on relevant parts of the input when producing each element of the output.

Encoder-Decoder Architecture

A neural network design where an encoder processes input into a fixed representation and a decoder generates output from it.

Autoencoder

A neural network trained to encode input data into a compressed representation and then decode it back to reconstruct the original.

Variational Autoencoder

A generative model that learns a probabilistic latent space representation, enabling generation of new data samples.

Batch Normalisation

A technique that normalises layer inputs during training to stabilise and accelerate deep neural network learning.

More in Deep Learning

Fine-Tuning

Language Models

The process of adapting a pre-trained model to a specific task by continuing training on a smaller task-specific dataset, transferring learned representations to new domains.

Weight Decay

Architectures

A regularisation technique that penalises large model weights during training by adding a fraction of the weight magnitude to the loss function, preventing overfitting.

Fully Connected Layer

Architectures

A neural network layer where every neuron is connected to every neuron in the adjacent layers.

Weight Initialisation

Architectures

The strategy for setting initial parameter values in a neural network before training begins.

Attention Head

Training & Optimisation

An individual attention computation within a multi-head attention layer that learns to focus on different aspects of the input, with outputs concatenated for richer representations.

ReLU

Training & Optimisation

Rectified Linear Unit — an activation function that outputs the input directly if positive, otherwise outputs zero.

Self-Attention

Training & Optimisation

An attention mechanism where each element in a sequence attends to all other elements to compute its representation.

Vanishing Gradient

Architectures

A problem in deep networks where gradients become extremely small during backpropagation, preventing earlier layers from learning.

Overview

Direct Answer

How It Works

Why It Matters

Common Applications

Key Considerations

Cross-References(1)

Related in Architectures

Deep Learning

Neural Network

Convolutional Neural Network

Recurrent Neural Network

Long Short-Term Memory

Gated Recurrent Unit

Transformer

Attention Mechanism

Encoder-Decoder Architecture

Autoencoder

Variational Autoencoder

Batch Normalisation

More in Deep Learning

Fine-Tuning

Weight Decay

Fully Connected Layer

Weight Initialisation

Attention Head

ReLU

Self-Attention

Vanishing Gradient

See Also

Strategy