Overview
Direct Answer
An attention head is an individual computational unit within a multi-head attention mechanism that applies learned queries, keys, and values to compute weighted relevance scores across input sequences. Each head independently learns to attend to different positional and semantic relationships, with outputs concatenated to form richer contextual representations.
How It Works
Each attention head performs scaled dot-product attention by computing compatibility scores between query vectors and key vectors, normalising these scores via softmax, and using them to weight value vectors. Multiple heads operate in parallel with separate learned parameter matrices, allowing the model to simultaneously capture syntactic patterns, long-range dependencies, and semantic features from different representation subspaces.
Why It Matters
Multiple independent heads improve model capacity and interpretability whilst maintaining computational efficiency through parallelisation. This architecture has become foundational for state-of-the-art performance in language understanding, translation, and sequence modelling tasks, directly impacting accuracy and convergence speed in production NLP systems.
Common Applications
Attention heads are integral to transformer models used in machine translation systems, large language models for text generation, and multimodal systems combining vision and language. They enable models like BERT and GPT to achieve superior performance on classification, question-answering, and summarisation tasks across enterprise applications.
Key Considerations
Practitioners must balance the number of heads against computational cost and memory requirements; too few heads may limit representational capacity whilst excessive heads introduce redundancy without proportional performance gains. Attention patterns across heads often show correlation, suggesting some redundancy is inherent to the design.
Cross-References(1)
More in Deep Learning
Model Parallelism
ArchitecturesA distributed training approach that partitions a model across multiple devices, enabling training of models too large to fit in a single accelerator's memory.
Data Parallelism
ArchitecturesA distributed training strategy that replicates the model across multiple devices and divides training data into batches processed simultaneously, synchronising gradients after each step.
Weight Decay
ArchitecturesA regularisation technique that penalises large model weights during training by adding a fraction of the weight magnitude to the loss function, preventing overfitting.
Deep Learning
ArchitecturesA subset of machine learning using neural networks with multiple layers to learn hierarchical representations of data.
Skip Connection
ArchitecturesA neural network shortcut that allows the output of one layer to bypass intermediate layers and be added to a later layer's output.
Fully Connected Layer
ArchitecturesA neural network layer where every neuron is connected to every neuron in the adjacent layers.
State Space Model
ArchitecturesA sequence modelling architecture based on continuous-time dynamical systems that processes long sequences with linear complexity, offering an alternative to attention-based transformers.
Adapter Layers
Language ModelsSmall trainable modules inserted between frozen transformer layers that enable task-specific adaptation without modifying the original model weights.