Deep LearningArchitectures

Key-Value Cache

Overview

An optimisation in autoregressive transformer inference that stores previously computed key and value tensors to avoid redundant computation during sequential token generation.

Cross-References(2)

Deep Learning
Blockchain & DLT

More in Deep Learning

Data Parallelism

Architectures

A distributed training strategy that replicates the model across multiple devices and divides training data into batches processed simultaneously, synchronising gradients after each step.

Self-Attention

Training & Optimisation

An attention mechanism where each element in a sequence attends to all other elements to compute its representation.

Activation Function

Training & Optimisation

A mathematical function applied to neural network outputs to introduce non-linearity, enabling the learning of complex patterns.

Vanishing Gradient

Architectures

A problem in deep networks where gradients become extremely small during backpropagation, preventing earlier layers from learning.

Prefix Tuning

Language Models

A parameter-efficient method that prepends trainable continuous vectors to the input of each transformer layer, guiding model behaviour without altering base parameters.

Residual Connection

Training & Optimisation

A skip connection that adds a layer's input directly to its output, enabling gradient flow through deep networks and allowing training of architectures with hundreds of layers.

Mixed Precision Training

Training & Optimisation

Training neural networks using both 16-bit and 32-bit floating-point arithmetic to speed up computation while maintaining accuracy.

Attention Head

Training & Optimisation

An individual attention computation within a multi-head attention layer that learns to focus on different aspects of the input, with outputs concatenated for richer representations.

See Also