Pre-Training — Technology Wiki

Overview

The initial phase of training a deep learning model on a large unlabelled corpus using self-supervised objectives, establishing general-purpose representations for downstream adaptation.

Cross-References(1)

Deep Learning

Related in Language Models

Word Embedding

Dense vector representations of words where semantically similar words are mapped to nearby points in vector space.

LoRA

Low-Rank Adaptation — a parameter-efficient fine-tuning technique that adds trainable low-rank matrices to frozen pretrained weights.

Parameter-Efficient Fine-Tuning

Methods for adapting large pretrained models to new tasks by only updating a small fraction of their parameters.

Adapter Layers

Small trainable modules inserted between frozen transformer layers that enable task-specific adaptation without modifying the original model weights.

Prefix Tuning

A parameter-efficient method that prepends trainable continuous vectors to the input of each transformer layer, guiding model behaviour without altering base parameters.

Fine-Tuning

The process of adapting a pre-trained model to a specific task by continuing training on a smaller task-specific dataset, transferring learned representations to new domains.

More in Deep Learning

Convolutional Neural Network

Architectures

A deep learning architecture designed for processing structured grid data like images, using convolutional filters to detect features.

Model Parallelism

Architectures

A distributed training approach that partitions a model across multiple devices, enabling training of models too large to fit in a single accelerator's memory.

Graph Neural Network

Architectures

A neural network designed to operate on graph-structured data, learning representations of nodes, edges, and entire graphs.

Capsule Network

Architectures

A neural network architecture that groups neurons into capsules to better capture spatial hierarchies and part-whole relationships.

Vanishing Gradient

Architectures

A problem in deep networks where gradients become extremely small during backpropagation, preventing earlier layers from learning.

Data Parallelism

Architectures

A distributed training strategy that replicates the model across multiple devices and divides training data into batches processed simultaneously, synchronising gradients after each step.

Residual Connection

Training & Optimisation

A skip connection that adds a layer's input directly to its output, enabling gradient flow through deep networks and allowing training of architectures with hundreds of layers.

Pipeline Parallelism

Architectures

A form of model parallelism that splits neural network layers across devices and pipelines micro-batches through stages, maximising hardware utilisation during training.