Overview
Direct Answer
A memory optimisation technique in transformer-based models that caches previously computed key and value tensors during autoregressive generation, eliminating redundant recalculation as each new token is produced. This mechanism significantly reduces computational overhead during inference without altering model outputs.
How It Works
During token generation, the transformer computes queries for the current token whilst reusing cached key-value pairs from prior positions rather than recomputing them. The cache is sequentially extended as each new token is generated, allowing attention operations to access historical representations in constant rather than quadratic time relative to sequence length. Modern implementations store these tensors in GPU memory or system RAM, depending on batch size and model dimensions.
Why It Matters
Key-value caching reduces inference latency by 2–3× on typical sequence lengths, directly lowering operational costs for production language models and enabling real-time interactive applications. For resource-constrained environments and large-scale deployments, this optimisation determines practical feasibility of transformer inference at scale.
Common Applications
Used extensively in conversational AI systems, real-time code generation tools, and streaming text summarisation services. Dialogue systems relying on multi-turn interactions particularly benefit from avoiding reprocessing of prior conversation history.
Key Considerations
Cache memory consumption scales linearly with batch size and sequence length, creating practical limits on concurrency and maximum context window. Careful management is required to prevent memory exhaustion, and cache invalidation strategies vary across frameworks and hardware configurations.
Cross-References(2)
More in Deep Learning
Residual Network
Training & OptimisationA deep neural network architecture using skip connections that allow gradients to flow directly through layers, enabling very deep networks.
Pretraining
ArchitecturesTraining a model on a large general dataset before fine-tuning it on a specific downstream task.
Dropout
Training & OptimisationA regularisation technique that randomly deactivates neurons during training to prevent co-adaptation and reduce overfitting.
Adapter Layers
Language ModelsSmall trainable modules inserted between frozen transformer layers that enable task-specific adaptation without modifying the original model weights.
Vision Transformer
ArchitecturesA transformer architecture adapted for image recognition that divides images into patches and processes them as sequences, rivalling convolutional networks in visual tasks.
Positional Encoding
Training & OptimisationA technique that injects information about the position of tokens in a sequence into transformer architectures.
Representation Learning
ArchitecturesThe automatic discovery of data representations needed for feature detection or classification from raw data.
Vanishing Gradient
ArchitecturesA problem in deep networks where gradients become extremely small during backpropagation, preventing earlier layers from learning.