Overview
An architecture where different specialised sub-networks (experts) are selectively activated based on the input.
More in Deep Learning
Self-Attention
Training & OptimisationAn attention mechanism where each element in a sequence attends to all other elements to compute its representation.
ReLU
Training & OptimisationRectified Linear Unit — an activation function that outputs the input directly if positive, otherwise outputs zero.
Knowledge Distillation
ArchitecturesA model compression technique where a smaller student model learns to mimic the behaviour of a larger teacher model.
Prefix Tuning
Language ModelsA parameter-efficient method that prepends trainable continuous vectors to the input of each transformer layer, guiding model behaviour without altering base parameters.
Fine-Tuning
ArchitecturesThe process of taking a pretrained model and further training it on a smaller, task-specific dataset.
Model Parallelism
ArchitecturesA distributed training approach that partitions a model across multiple devices, enabling training of models too large to fit in a single accelerator's memory.
Pretraining
ArchitecturesTraining a model on a large general dataset before fine-tuning it on a specific downstream task.
Adapter Layers
Language ModelsSmall trainable modules inserted between frozen transformer layers that enable task-specific adaptation without modifying the original model weights.