Overview
Direct Answer
Stochastic Gradient Descent (SGD) is an optimisation algorithm that updates model parameters using the gradient computed from a single training example or small batch, rather than the entire dataset. This probabilistic approach to parameter adjustment trades some convergence certainty for computational efficiency and faster iteration cycles.
How It Works
At each iteration, SGD samples a single instance or mini-batch randomly from the training set, computes the loss gradient with respect to that sample, and adjusts parameters in the direction opposite to the gradient by a step size called the learning rate. The stochastic nature—randomness in sample selection—introduces noise into the parameter trajectory, which can help escape local minima and reduce memory requirements compared to full-batch methods.
Why It Matters
SGD enables training on datasets too large to fit in memory and reduces wall-clock time per iteration significantly, making it essential for modern deep learning at scale. The noise-induced exploration properties often lead to better generalisation on unseen data, whilst the reduced computational footprint per step allows practitioners to iterate on model design rapidly.
Common Applications
SGD is the foundation for training neural networks across computer vision, natural language processing, and recommendation systems. It underpins backpropagation in deep learning frameworks and remains standard in federated learning environments where data partitioning across devices necessitates sample-wise or batch-wise updates.
Key Considerations
The learning rate becomes critical since constant steps with noisy gradients risk divergence; adaptive variants like Adam and RMSprop address this by adjusting step sizes per parameter. Convergence guarantees weaken compared to batch gradient descent, and practitioners must balance batch size, learning rate scheduling, and epoch count empirically.
Cross-References(1)
Referenced By1 term mentions Stochastic Gradient Descent
Other entries in the wiki whose definition references Stochastic Gradient Descent — useful for understanding how this concept connects across Machine Learning and adjacent domains.
More in Machine Learning
Gradient Boosting
Supervised LearningAn ensemble technique that builds models sequentially, with each new model correcting residual errors of the combined ensemble.
Self-Supervised Learning
Advanced MethodsA learning paradigm where models generate their own supervisory signals from unlabelled data through pretext tasks.
Supervised Learning
MLOps & ProductionA machine learning paradigm where models are trained on labelled data, learning to map inputs to known outputs.
t-SNE
Unsupervised Learningt-Distributed Stochastic Neighbour Embedding — a technique for visualising high-dimensional data in two or three dimensions.
Naive Bayes
Supervised LearningA probabilistic classifier based on applying Bayes' theorem with the assumption of independence between features.
UMAP
Unsupervised LearningUniform Manifold Approximation and Projection — a dimensionality reduction technique for visualisation and general non-linear reduction.
SMOTE
Feature Engineering & SelectionSynthetic Minority Over-sampling Technique — a method for addressing class imbalance by generating synthetic examples of the minority class.
Random Forest
Supervised LearningAn ensemble learning method that constructs multiple decision trees during training and outputs the mode of their predictions.