Overview
A technique that injects information about the position of tokens in a sequence into transformer architectures.
Cross-References(1)
More in Deep Learning
Mixture of Experts
ArchitecturesAn architecture where different specialised sub-networks (experts) are selectively activated based on the input.
Deep Learning
ArchitecturesA subset of machine learning using neural networks with multiple layers to learn hierarchical representations of data.
Gradient Checkpointing
ArchitecturesA memory optimisation that trades computation for memory by recomputing intermediate activations during the backward pass instead of storing them all during the forward pass.
Data Parallelism
ArchitecturesA distributed training strategy that replicates the model across multiple devices and divides training data into batches processed simultaneously, synchronising gradients after each step.
Pretraining
ArchitecturesTraining a model on a large general dataset before fine-tuning it on a specific downstream task.
Knowledge Distillation
ArchitecturesA model compression technique where a smaller student model learns to mimic the behaviour of a larger teacher model.
Generative Adversarial Network
Generative ModelsA framework where two neural networks compete — a generator creates synthetic data while a discriminator evaluates its authenticity.
Recurrent Neural Network
ArchitecturesA neural network architecture where connections between nodes form directed cycles, enabling processing of sequential data.