Overview
A subset of machine learning using neural networks with multiple layers to learn hierarchical representations of data.
Cross-References(1)
More in Deep Learning
Attention Head
Training & OptimisationAn individual attention computation within a multi-head attention layer that learns to focus on different aspects of the input, with outputs concatenated for richer representations.
Residual Network
Training & OptimisationA deep neural network architecture using skip connections that allow gradients to flow directly through layers, enabling very deep networks.
Multi-Head Attention
Training & OptimisationAn attention mechanism that runs multiple attention operations in parallel, capturing different types of relationships.
Gradient Checkpointing
ArchitecturesA memory optimisation that trades computation for memory by recomputing intermediate activations during the backward pass instead of storing them all during the forward pass.
Sigmoid Function
Training & OptimisationAn activation function that maps input values to a range between 0 and 1, useful for binary classification outputs.
Rotary Positional Encoding
Training & OptimisationA position encoding method that encodes absolute position with a rotation matrix and naturally incorporates relative position information into attention computations.
Weight Decay
ArchitecturesA regularisation technique that penalises large model weights during training by adding a fraction of the weight magnitude to the loss function, preventing overfitting.
Exploding Gradient
ArchitecturesA problem where gradients grow exponentially during backpropagation, causing unstable weight updates and training failure.