Dropout

Overview

A regularisation technique that randomly deactivates neurons during training to prevent co-adaptation and reduce overfitting.

Cross-References(2)

Machine Learning

Regularisation Overfitting

Related in Training & Optimisation

Self-Attention

An attention mechanism where each element in a sequence attends to all other elements to compute its representation.

Multi-Head Attention

An attention mechanism that runs multiple attention operations in parallel, capturing different types of relationships.

Residual Network

A deep neural network architecture using skip connections that allow gradients to flow directly through layers, enabling very deep networks.

Layer Normalisation

A normalisation technique that normalises across the features of each individual sample rather than across the batch.

Activation Function

A mathematical function applied to neural network outputs to introduce non-linearity, enabling the learning of complex patterns.

ReLU

Rectified Linear Unit — an activation function that outputs the input directly if positive, otherwise outputs zero.

Sigmoid Function

An activation function that maps input values to a range between 0 and 1, useful for binary classification outputs.

Softmax Function

An activation function that converts a vector of numbers into a probability distribution, commonly used in multi-class classification.

Positional Encoding

A technique that injects information about the position of tokens in a sequence into transformer architectures.

Gradient Clipping

A technique that caps gradient values during training to prevent the exploding gradient problem.

Mixed Precision Training

Training neural networks using both 16-bit and 32-bit floating-point arithmetic to speed up computation while maintaining accuracy.

Rotary Positional Encoding

A position encoding method that encodes absolute position with a rotation matrix and naturally incorporates relative position information into attention computations.

More in Deep Learning

State Space Model

Architectures

A sequence modelling architecture based on continuous-time dynamical systems that processes long sequences with linear complexity, offering an alternative to attention-based transformers.

Representation Learning

Architectures

The automatic discovery of data representations needed for feature detection or classification from raw data.

Recurrent Neural Network

Architectures

A neural network architecture where connections between nodes form directed cycles, enabling processing of sequential data.

Encoder-Decoder Architecture

Architectures

A neural network design where an encoder processes input into a fixed representation and a decoder generates output from it.

Vanishing Gradient

Architectures

A problem in deep networks where gradients become extremely small during backpropagation, preventing earlier layers from learning.

Tensor Parallelism

Architectures

A distributed computing strategy that splits individual layer computations across multiple devices by partitioning weight matrices along specific dimensions.

Mamba Architecture

Architectures

A selective state space model that achieves transformer-level performance with linear-time complexity by incorporating input-dependent selection mechanisms into the recurrence.

Pretraining

Architectures

Training a model on a large general dataset before fine-tuning it on a specific downstream task.

Overview

Cross-References(2)

Related in Training & Optimisation

Self-Attention

Multi-Head Attention

Residual Network

Layer Normalisation

Activation Function

ReLU

Sigmoid Function

Softmax Function

Positional Encoding

Gradient Clipping

Mixed Precision Training

Rotary Positional Encoding

More in Deep Learning

State Space Model

Representation Learning

Recurrent Neural Network

Encoder-Decoder Architecture

Vanishing Gradient

Tensor Parallelism

Mamba Architecture

Pretraining

See Also

Overfitting

Regularisation