Overview
An attention mechanism where each element in a sequence attends to all other elements to compute its representation.
Cross-References(1)
More in Deep Learning
Weight Initialisation
ArchitecturesThe strategy for setting initial parameter values in a neural network before training begins.
Vanishing Gradient
ArchitecturesA problem in deep networks where gradients become extremely small during backpropagation, preventing earlier layers from learning.
Mamba Architecture
ArchitecturesA selective state space model that achieves transformer-level performance with linear-time complexity by incorporating input-dependent selection mechanisms into the recurrence.
Graph Neural Network
ArchitecturesA neural network designed to operate on graph-structured data, learning representations of nodes, edges, and entire graphs.
Mixture of Experts
ArchitecturesAn architecture where different specialised sub-networks (experts) are selectively activated based on the input.
Neural Network
ArchitecturesA computing system inspired by biological neural networks, consisting of interconnected nodes that process information in layers.
Fine-Tuning
Language ModelsThe process of adapting a pre-trained model to a specific task by continuing training on a smaller task-specific dataset, transferring learned representations to new domains.
Pipeline Parallelism
ArchitecturesA form of model parallelism that splits neural network layers across devices and pipelines micro-batches through stages, maximising hardware utilisation during training.