Fine-Tuning — Technology Wiki

Overview

Direct Answer

Fine-tuning is the process of taking a pre-trained neural network model and retraining its weights on a smaller, task-specific dataset to adapt its learned representations to a new domain or objective. This approach leverages existing feature knowledge whilst specialising the model for particular downstream tasks.

How It Works

The process begins with a model already trained on large-scale data, which has developed generalised feature detectors across its layers. Training resumes on the task-specific dataset, typically with a reduced learning rate to preserve earlier learned representations whilst allowing subtle weight adjustments. Some layers may be frozen to maintain their feature extractors, whilst deeper or output layers are trained more aggressively.

Why It Matters

Fine-tuning dramatically reduces training time and data requirements compared to training from scratch, lowering computational costs and enabling rapid deployment in resource-constrained settings. It achieves superior accuracy on specialised tasks where collecting large labelled datasets is prohibitively expensive, making advanced AI accessible to organisations without massive data resources.

Common Applications

Practical applications include adapting large language models to domain-specific language (legal contracts, medical notes), customising vision models for medical imaging or defect detection, and personalising recommendation systems. Named applications span natural language processing, computer vision in manufacturing, and financial fraud detection systems.

Key Considerations

Practitioners must balance learning rate selection to avoid catastrophic forgetting, where the model loses previously learned features, and avoid overfitting on small task datasets. Dataset quality and representativeness are critical, and the choice of which layers to freeze involves careful tradeoffs between computational efficiency and task performance.

Related in Language Models

Word Embedding

Dense vector representations of words where semantically similar words are mapped to nearby points in vector space.

LoRA

Low-Rank Adaptation — a parameter-efficient fine-tuning technique that adds trainable low-rank matrices to frozen pretrained weights.

Parameter-Efficient Fine-Tuning

Methods for adapting large pretrained models to new tasks by only updating a small fraction of their parameters.

Adapter Layers

Small trainable modules inserted between frozen transformer layers that enable task-specific adaptation without modifying the original model weights.

Prefix Tuning

A parameter-efficient method that prepends trainable continuous vectors to the input of each transformer layer, guiding model behaviour without altering base parameters.

Pre-Training

The initial phase of training a deep learning model on a large unlabelled corpus using self-supervised objectives, establishing general-purpose representations for downstream adaptation.

More in Deep Learning

Positional Encoding

Training & Optimisation

A technique that injects information about the position of tokens in a sequence into transformer architectures.

Neural Network

Architectures

A computing system inspired by biological neural networks, consisting of interconnected nodes that process information in layers.

Dropout

Training & Optimisation

A regularisation technique that randomly deactivates neurons during training to prevent co-adaptation and reduce overfitting.

Deep Learning

Architectures

A subset of machine learning using neural networks with multiple layers to learn hierarchical representations of data.

Gated Recurrent Unit

Architectures

A simplified variant of LSTM that combines the forget and input gates into a single update gate.

Rotary Positional Encoding

Training & Optimisation

A position encoding method that encodes absolute position with a rotation matrix and naturally incorporates relative position information into attention computations.

Data Parallelism

Architectures

A distributed training strategy that replicates the model across multiple devices and divides training data into batches processed simultaneously, synchronising gradients after each step.

ReLU

Training & Optimisation

Rectified Linear Unit — an activation function that outputs the input directly if positive, otherwise outputs zero.