Adapter Layers — Technology Wiki

Overview

Small trainable modules inserted between frozen transformer layers that enable task-specific adaptation without modifying the original model weights.

Cross-References(1)

Deep Learning

Transformer

Related in Language Models

Word Embedding

Dense vector representations of words where semantically similar words are mapped to nearby points in vector space.

LoRA

Low-Rank Adaptation — a parameter-efficient fine-tuning technique that adds trainable low-rank matrices to frozen pretrained weights.

Parameter-Efficient Fine-Tuning

Methods for adapting large pretrained models to new tasks by only updating a small fraction of their parameters.

Prefix Tuning

A parameter-efficient method that prepends trainable continuous vectors to the input of each transformer layer, guiding model behaviour without altering base parameters.

Pre-Training

The initial phase of training a deep learning model on a large unlabelled corpus using self-supervised objectives, establishing general-purpose representations for downstream adaptation.

Fine-Tuning

The process of adapting a pre-trained model to a specific task by continuing training on a smaller task-specific dataset, transferring learned representations to new domains.

More in Deep Learning

Tensor Parallelism

Architectures

A distributed computing strategy that splits individual layer computations across multiple devices by partitioning weight matrices along specific dimensions.

Knowledge Distillation

Architectures

A model compression technique where a smaller student model learns to mimic the behaviour of a larger teacher model.

Gated Recurrent Unit

Architectures

A simplified variant of LSTM that combines the forget and input gates into a single update gate.

Softmax Function

Training & Optimisation

An activation function that converts a vector of numbers into a probability distribution, commonly used in multi-class classification.

Rotary Positional Encoding

Training & Optimisation

A position encoding method that encodes absolute position with a rotation matrix and naturally incorporates relative position information into attention computations.

Contrastive Learning

Architectures

A self-supervised learning approach that trains models by comparing similar and dissimilar pairs of data representations.

Convolutional Layer

Architectures

A neural network layer that applies learnable filters across input data to detect local patterns and features.

Key-Value Cache

Architectures

An optimisation in autoregressive transformer inference that stores previously computed key and value tensors to avoid redundant computation during sequential token generation.