Overview
Direct Answer
An attention mechanism is a neural network component that dynamically weights input elements to selectively focus on the most relevant information when computing each output representation. It enables models to learn which parts of the input to prioritise, rather than treating all inputs equally.
How It Works
The mechanism computes attention weights through a scaled dot-product calculation between query and key vectors, then applies these weights to value vectors via softmax normalisation. This allows the network to assign higher importance to semantically relevant positions whilst suppressing irrelevant ones, creating context-dependent output representations.
Why It Matters
Attention significantly improves model accuracy on sequence-to-sequence tasks, reduces training time through parallelisation, and enables interpretability by revealing which input regions influenced specific predictions. These improvements directly enhance performance in translation, summarisation, and question-answering systems whilst reducing computational waste.
Common Applications
Machine translation (encoder-decoder architectures), natural language understanding in transformer-based models, image captioning, speech recognition, and clinical text analysis. Multi-head variants are standard in contemporary large language models and vision transformers.
Key Considerations
Computational complexity scales quadratically with sequence length, limiting applicability to very long documents without approximation techniques. Practitioners must balance interpretability gains against increased model complexity and memory requirements during inference.
Cross-References(1)
Referenced By3 terms mention Attention Mechanism
Other entries in the wiki whose definition references Attention Mechanism — useful for understanding how this concept connects across Deep Learning and adjacent domains.
More in Deep Learning
Rotary Positional Encoding
Training & OptimisationA position encoding method that encodes absolute position with a rotation matrix and naturally incorporates relative position information into attention computations.
Adapter Layers
Language ModelsSmall trainable modules inserted between frozen transformer layers that enable task-specific adaptation without modifying the original model weights.
Self-Attention
Training & OptimisationAn attention mechanism where each element in a sequence attends to all other elements to compute its representation.
Multi-Head Attention
Training & OptimisationAn attention mechanism that runs multiple attention operations in parallel, capturing different types of relationships.
LoRA
Language ModelsLow-Rank Adaptation — a parameter-efficient fine-tuning technique that adds trainable low-rank matrices to frozen pretrained weights.
Gradient Checkpointing
ArchitecturesA memory optimisation that trades computation for memory by recomputing intermediate activations during the backward pass instead of storing them all during the forward pass.
Pipeline Parallelism
ArchitecturesA form of model parallelism that splits neural network layers across devices and pipelines micro-batches through stages, maximising hardware utilisation during training.
Key-Value Cache
ArchitecturesAn optimisation in autoregressive transformer inference that stores previously computed key and value tensors to avoid redundant computation during sequential token generation.