Overview
An attention mechanism that runs multiple attention operations in parallel, capturing different types of relationships.
Cross-References(1)
More in Deep Learning
Weight Decay
ArchitecturesA regularisation technique that penalises large model weights during training by adding a fraction of the weight magnitude to the loss function, preventing overfitting.
Gated Recurrent Unit
ArchitecturesA simplified variant of LSTM that combines the forget and input gates into a single update gate.
Capsule Network
ArchitecturesA neural network architecture that groups neurons into capsules to better capture spatial hierarchies and part-whole relationships.
Neural Network
ArchitecturesA computing system inspired by biological neural networks, consisting of interconnected nodes that process information in layers.
Gradient Checkpointing
ArchitecturesA memory optimisation that trades computation for memory by recomputing intermediate activations during the backward pass instead of storing them all during the forward pass.
Convolutional Layer
ArchitecturesA neural network layer that applies learnable filters across input data to detect local patterns and features.
Pipeline Parallelism
ArchitecturesA form of model parallelism that splits neural network layers across devices and pipelines micro-batches through stages, maximising hardware utilisation during training.
Recurrent Neural Network
ArchitecturesA neural network architecture where connections between nodes form directed cycles, enabling processing of sequential data.