Overview
A mathematical function applied to neural network outputs to introduce non-linearity, enabling the learning of complex patterns.
Cross-References(1)
More in Deep Learning
Diffusion Model
Generative ModelsA generative model that learns to reverse a gradual noising process, generating high-quality samples from random noise.
Exploding Gradient
ArchitecturesA problem where gradients grow exponentially during backpropagation, causing unstable weight updates and training failure.
Fine-Tuning
Language ModelsThe process of adapting a pre-trained model to a specific task by continuing training on a smaller task-specific dataset, transferring learned representations to new domains.
Model Parallelism
ArchitecturesA distributed training approach that partitions a model across multiple devices, enabling training of models too large to fit in a single accelerator's memory.
State Space Model
ArchitecturesA sequence modelling architecture based on continuous-time dynamical systems that processes long sequences with linear complexity, offering an alternative to attention-based transformers.
Attention Head
Training & OptimisationAn individual attention computation within a multi-head attention layer that learns to focus on different aspects of the input, with outputs concatenated for richer representations.
LoRA
Language ModelsLow-Rank Adaptation — a parameter-efficient fine-tuning technique that adds trainable low-rank matrices to frozen pretrained weights.
Mixture of Experts
ArchitecturesAn architecture where different specialised sub-networks (experts) are selectively activated based on the input.