Positional Encoding — Technology Wiki

Overview

A technique that injects information about the position of tokens in a sequence into transformer architectures.

Cross-References(1)

Deep Learning

Transformer

Related in Training & Optimisation

Self-Attention

An attention mechanism where each element in a sequence attends to all other elements to compute its representation.

Multi-Head Attention

An attention mechanism that runs multiple attention operations in parallel, capturing different types of relationships.

Residual Network

A deep neural network architecture using skip connections that allow gradients to flow directly through layers, enabling very deep networks.

Layer Normalisation

A normalisation technique that normalises across the features of each individual sample rather than across the batch.

Dropout

A regularisation technique that randomly deactivates neurons during training to prevent co-adaptation and reduce overfitting.

Activation Function

A mathematical function applied to neural network outputs to introduce non-linearity, enabling the learning of complex patterns.

ReLU

Rectified Linear Unit — an activation function that outputs the input directly if positive, otherwise outputs zero.

Sigmoid Function

An activation function that maps input values to a range between 0 and 1, useful for binary classification outputs.

Softmax Function

An activation function that converts a vector of numbers into a probability distribution, commonly used in multi-class classification.

Gradient Clipping

A technique that caps gradient values during training to prevent the exploding gradient problem.

Mixed Precision Training

Training neural networks using both 16-bit and 32-bit floating-point arithmetic to speed up computation while maintaining accuracy.

Rotary Positional Encoding

A position encoding method that encodes absolute position with a rotation matrix and naturally incorporates relative position information into attention computations.

More in Deep Learning

Mixture of Experts

Architectures

An architecture where different specialised sub-networks (experts) are selectively activated based on the input.

Deep Learning

Architectures

A subset of machine learning using neural networks with multiple layers to learn hierarchical representations of data.

Gradient Checkpointing

Architectures

A memory optimisation that trades computation for memory by recomputing intermediate activations during the backward pass instead of storing them all during the forward pass.

Data Parallelism

Architectures

A distributed training strategy that replicates the model across multiple devices and divides training data into batches processed simultaneously, synchronising gradients after each step.

Pretraining

Architectures

Training a model on a large general dataset before fine-tuning it on a specific downstream task.

Knowledge Distillation

Architectures

A model compression technique where a smaller student model learns to mimic the behaviour of a larger teacher model.

Generative Adversarial Network

Generative Models

A framework where two neural networks compete — a generator creates synthetic data while a discriminator evaluates its authenticity.

Recurrent Neural Network

Architectures

A neural network architecture where connections between nodes form directed cycles, enabling processing of sequential data.