Attention Head — Technology Wiki

Overview

Direct Answer

An attention head is an individual computational unit within a multi-head attention mechanism that applies learned queries, keys, and values to compute weighted relevance scores across input sequences. Each head independently learns to attend to different positional and semantic relationships, with outputs concatenated to form richer contextual representations.

How It Works

Each attention head performs scaled dot-product attention by computing compatibility scores between query vectors and key vectors, normalising these scores via softmax, and using them to weight value vectors. Multiple heads operate in parallel with separate learned parameter matrices, allowing the model to simultaneously capture syntactic patterns, long-range dependencies, and semantic features from different representation subspaces.

Why It Matters

Multiple independent heads improve model capacity and interpretability whilst maintaining computational efficiency through parallelisation. This architecture has become foundational for state-of-the-art performance in language understanding, translation, and sequence modelling tasks, directly impacting accuracy and convergence speed in production NLP systems.

Common Applications

Attention heads are integral to transformer models used in machine translation systems, large language models for text generation, and multimodal systems combining vision and language. They enable models like BERT and GPT to achieve superior performance on classification, question-answering, and summarisation tasks across enterprise applications.

Key Considerations

Practitioners must balance the number of heads against computational cost and memory requirements; too few heads may limit representational capacity whilst excessive heads introduce redundancy without proportional performance gains. Attention patterns across heads often show correlation, suggesting some redundancy is inherent to the design.

Cross-References(1)

Deep Learning

Multi-Head Attention

Related in Training & Optimisation

Self-Attention

An attention mechanism where each element in a sequence attends to all other elements to compute its representation.

Multi-Head Attention

An attention mechanism that runs multiple attention operations in parallel, capturing different types of relationships.

Residual Network

A deep neural network architecture using skip connections that allow gradients to flow directly through layers, enabling very deep networks.

Layer Normalisation

A normalisation technique that normalises across the features of each individual sample rather than across the batch.

Dropout

A regularisation technique that randomly deactivates neurons during training to prevent co-adaptation and reduce overfitting.

Activation Function

A mathematical function applied to neural network outputs to introduce non-linearity, enabling the learning of complex patterns.

ReLU

Rectified Linear Unit — an activation function that outputs the input directly if positive, otherwise outputs zero.

Sigmoid Function

An activation function that maps input values to a range between 0 and 1, useful for binary classification outputs.

Softmax Function

An activation function that converts a vector of numbers into a probability distribution, commonly used in multi-class classification.

Positional Encoding

A technique that injects information about the position of tokens in a sequence into transformer architectures.

Gradient Clipping

A technique that caps gradient values during training to prevent the exploding gradient problem.

Mixed Precision Training

Training neural networks using both 16-bit and 32-bit floating-point arithmetic to speed up computation while maintaining accuracy.

More in Deep Learning

Model Parallelism

Architectures

A distributed training approach that partitions a model across multiple devices, enabling training of models too large to fit in a single accelerator's memory.

Data Parallelism

Architectures

A distributed training strategy that replicates the model across multiple devices and divides training data into batches processed simultaneously, synchronising gradients after each step.

Weight Decay

Architectures

A regularisation technique that penalises large model weights during training by adding a fraction of the weight magnitude to the loss function, preventing overfitting.

Deep Learning

Architectures

A subset of machine learning using neural networks with multiple layers to learn hierarchical representations of data.

Skip Connection

Architectures

A neural network shortcut that allows the output of one layer to bypass intermediate layers and be added to a later layer's output.

Fully Connected Layer

Architectures

A neural network layer where every neuron is connected to every neuron in the adjacent layers.

State Space Model

Architectures

A sequence modelling architecture based on continuous-time dynamical systems that processes long sequences with linear complexity, offering an alternative to attention-based transformers.

Adapter Layers

Language Models

Small trainable modules inserted between frozen transformer layers that enable task-specific adaptation without modifying the original model weights.