Overview
An individual attention computation within a multi-head attention layer that learns to focus on different aspects of the input, with outputs concatenated for richer representations.
Cross-References(1)
More in Deep Learning
Convolutional Neural Network
ArchitecturesA deep learning architecture designed for processing structured grid data like images, using convolutional filters to detect features.
Pipeline Parallelism
ArchitecturesA form of model parallelism that splits neural network layers across devices and pipelines micro-batches through stages, maximising hardware utilisation during training.
Pooling Layer
ArchitecturesA neural network layer that reduces spatial dimensions by aggregating values, commonly using max or average operations.
Vanishing Gradient
ArchitecturesA problem in deep networks where gradients become extremely small during backpropagation, preventing earlier layers from learning.
Attention Mechanism
ArchitecturesA neural network component that learns to focus on relevant parts of the input when producing each element of the output.
Batch Normalisation
ArchitecturesA technique that normalises layer inputs during training to stabilise and accelerate deep neural network learning.
Flash Attention
ArchitecturesAn IO-aware attention algorithm that reduces memory reads and writes by tiling the attention computation, enabling faster training of long-context transformer models.
Contrastive Learning
ArchitecturesA self-supervised learning approach that trains models by comparing similar and dissimilar pairs of data representations.