Overview
Direct Answer
Adam (Adaptive Moment Estimation) is a first-order gradient-based optimisation algorithm that maintains per-parameter adaptive learning rates by computing exponential moving averages of both gradients and squared gradients. It combines the benefits of momentum-based methods with element-wise adaptive learning rate scaling, making it particularly effective for training deep neural networks with sparse or noisy gradients.
How It Works
The algorithm maintains two moment estimates for each parameter: the first moment (mean of gradients, analogous to momentum) and the second moment (mean of squared gradients, similar to RMSProp). At each iteration, these moving averages are updated using exponential decay rates, then bias-corrected to account for initialisation at zero. The parameter update is computed by dividing the first moment by the square root of the second moment plus a small epsilon term, producing an effective adaptive step size that varies per dimension.
Why It Matters
Adam optimiser has become the de facto standard for training deep learning models because it converges faster than vanilla stochastic gradient descent and requires minimal hyperparameter tuning. Its adaptive per-parameter learning rates reduce sensitivity to learning rate scheduling, lowering computational overhead and enabling faster experimentation cycles—critical factors in organisations developing large-scale machine learning systems where training time directly impacts cost and deployment velocity.
Common Applications
The optimiser is ubiquitously employed in computer vision tasks such as convolutional neural network training, natural language processing models including transformer-based architectures, and reinforcement learning agent training. It is the default choice across most deep learning frameworks and has become standard practice in both research and production environments across financial services, healthcare, and technology sectors.
Key Considerations
While computationally efficient, the algorithm requires additional memory to store moment estimates for each parameter, which can be prohibitive for extremely large models. The bias-correction mechanism is essential for convergence in early training iterations, and the method's performance remains sensitive to the exponential decay rates and epsilon hyperparameters in certain problem domains.
Cross-References(2)
More in Machine Learning
UMAP
Unsupervised LearningUniform Manifold Approximation and Projection — a dimensionality reduction technique for visualisation and general non-linear reduction.
Feature Store
MLOps & ProductionA centralised repository for storing, managing, and serving machine learning features, ensuring consistency between training and inference environments across an organisation.
Markov Decision Process
Reinforcement LearningA mathematical framework for modelling sequential decision-making where outcomes are partly random and partly controlled.
Mini-Batch
Training TechniquesA subset of the training data used to compute a gradient update during stochastic gradient descent.
Boosting
Supervised LearningAn ensemble technique that sequentially trains models, each focusing on correcting the errors of previous models.
Logistic Regression
Supervised LearningA classification algorithm that models the probability of a binary outcome using a logistic function.
Machine Learning
MLOps & ProductionA subset of AI that enables systems to automatically learn and improve from experience without being explicitly programmed.
Feature Selection
MLOps & ProductionThe process of identifying and selecting the most relevant input variables for a machine learning model.