Mini-Batch — Technology Wiki

Overview

Direct Answer

A mini-batch is a small, fixed-size subset of training data used to compute a single gradient update during iterative optimisation. It represents a practical compromise between processing individual samples (stochastic gradient descent) and the entire dataset (batch gradient descent).

How It Works

During each training iteration, a mini-batch of typically 32 to 512 samples is selected from the training dataset. The model computes predictions for all samples in the subset, calculates the loss across those samples, and backpropagates to produce a single gradient estimate. This aggregated gradient is used to update model weights before the next mini-batch is processed.

Why It Matters

Mini-batches enable efficient hardware utilisation by vectorising computations across multiple samples simultaneously, reducing training time substantially on GPUs and TPUs. They also provide more stable gradient estimates than single-sample updates, improving convergence behaviour and final model accuracy whilst maintaining computational feasibility for large datasets.

Common Applications

Mini-batch training is standard in deep learning frameworks across computer vision (image classification), natural language processing (transformer model training), and recommender systems. It is universally employed in production machine learning pipelines for neural networks, whether in research institutions or enterprise deployments.

Key Considerations

The choice of batch size introduces a hyperparameter tuning requirement; larger batches reduce noise but may converge to sharper minima, whilst smaller batches provide regularisation effects but increase training iterations. Memory constraints and hardware availability often dictate practical batch size limits.

Cross-References(2)

Machine Learning

Stochastic Gradient Descent Gradient Descent

Related in Training Techniques

Ridge Regression

A regularised regression technique that adds an L2 penalty term to prevent overfitting by constraining coefficient magnitudes.

Elastic Net

A regularisation technique combining L1 and L2 penalties, balancing feature selection and coefficient shrinkage.

Cross-Validation

A resampling technique that partitions data into subsets, training on some and validating on others to assess model generalisation.

Overfitting

When a model learns the training data too well, including noise, resulting in poor performance on unseen data.

Underfitting

When a model is too simple to capture the underlying patterns in the data, resulting in poor performance on both training and test data.

Bias-Variance Tradeoff

The balance between a model's ability to minimise bias (error from assumptions) and variance (sensitivity to training data fluctuations).

Regularisation

Techniques that add constraints or penalties to a model to prevent overfitting and improve generalisation to new data.

Gradient Descent

An optimisation algorithm that iteratively adjusts parameters in the direction of steepest descent of the loss function.

Stochastic Gradient Descent

A variant of gradient descent that updates parameters using a randomly selected subset of training data each iteration.

Adam Optimiser

An adaptive learning rate optimisation algorithm combining momentum and RMSProp for efficient deep learning training.

Learning Rate

A hyperparameter that controls how much model parameters are adjusted with respect to the loss gradient during training.

Loss Function

A mathematical function that measures the difference between predicted outputs and actual target values during model training.

More in Machine Learning

Model Serving

MLOps & Production

The infrastructure and processes for deploying trained machine learning models to production environments for real-time predictions.

Semi-Supervised Learning

Advanced Methods

A learning approach that combines a small amount of labelled data with a large amount of unlabelled data during training.

Multi-Task Learning

MLOps & Production

A machine learning approach where a model is simultaneously trained on multiple related tasks to improve generalisation.

Deep Reinforcement Learning

Reinforcement Learning

Combining deep neural networks with reinforcement learning to enable agents to learn complex decision-making from raw sensory input.

Decision Tree

Supervised Learning

A tree-structured model where internal nodes represent feature tests, branches represent outcomes, and leaves represent predictions.

Feature Engineering

Feature Engineering & Selection

The process of using domain knowledge to create, select, and transform input variables to improve model performance.

Naive Bayes

Supervised Learning

A probabilistic classifier based on applying Bayes' theorem with the assumption of independence between features.

Bagging

Advanced Methods

Bootstrap Aggregating — an ensemble method that trains multiple models on random subsets of data and averages their predictions.