Self-Attention — Technology Wiki

Overview

An attention mechanism where each element in a sequence attends to all other elements to compute its representation.

Cross-References(1)

Deep Learning

Attention Mechanism

Related in Training & Optimisation

Multi-Head Attention

An attention mechanism that runs multiple attention operations in parallel, capturing different types of relationships.

Residual Network

A deep neural network architecture using skip connections that allow gradients to flow directly through layers, enabling very deep networks.

Layer Normalisation

A normalisation technique that normalises across the features of each individual sample rather than across the batch.

Dropout

A regularisation technique that randomly deactivates neurons during training to prevent co-adaptation and reduce overfitting.

Activation Function

A mathematical function applied to neural network outputs to introduce non-linearity, enabling the learning of complex patterns.

ReLU

Rectified Linear Unit — an activation function that outputs the input directly if positive, otherwise outputs zero.

Sigmoid Function

An activation function that maps input values to a range between 0 and 1, useful for binary classification outputs.

Softmax Function

An activation function that converts a vector of numbers into a probability distribution, commonly used in multi-class classification.

Positional Encoding

A technique that injects information about the position of tokens in a sequence into transformer architectures.

Gradient Clipping

A technique that caps gradient values during training to prevent the exploding gradient problem.

Mixed Precision Training

Training neural networks using both 16-bit and 32-bit floating-point arithmetic to speed up computation while maintaining accuracy.

Rotary Positional Encoding

A position encoding method that encodes absolute position with a rotation matrix and naturally incorporates relative position information into attention computations.

More in Deep Learning

Weight Initialisation

Architectures

The strategy for setting initial parameter values in a neural network before training begins.

Vanishing Gradient

Architectures

A problem in deep networks where gradients become extremely small during backpropagation, preventing earlier layers from learning.

Mamba Architecture

Architectures

A selective state space model that achieves transformer-level performance with linear-time complexity by incorporating input-dependent selection mechanisms into the recurrence.

Graph Neural Network

Architectures

A neural network designed to operate on graph-structured data, learning representations of nodes, edges, and entire graphs.

Mixture of Experts

Architectures

An architecture where different specialised sub-networks (experts) are selectively activated based on the input.

Neural Network

Architectures

A computing system inspired by biological neural networks, consisting of interconnected nodes that process information in layers.

Fine-Tuning

Language Models

The process of adapting a pre-trained model to a specific task by continuing training on a smaller task-specific dataset, transferring learned representations to new domains.

Pipeline Parallelism

Architectures

A form of model parallelism that splits neural network layers across devices and pipelines micro-batches through stages, maximising hardware utilisation during training.