Overview
Direct Answer
SMOTE is a data preprocessing technique that addresses class imbalance by generating synthetic training examples in the feature space of the minority class, rather than simply duplicating existing minority instances. It uses k-nearest neighbours to create new synthetic samples along the line segments connecting minority class examples.
How It Works
The algorithm identifies minority class samples and, for each one, locates its k-nearest neighbours (typically k=5) within the same class. New synthetic samples are then generated by randomly interpolating between a minority instance and one of its selected neighbours, positioning them at random points along the connecting line in feature space. This process is repeated until the desired balance ratio is achieved.
Why It Matters
Class imbalance severely degrades classifier performance on minority classes, leading to poor recall and F1-scores in critical domains such as fraud detection, disease diagnosis, and anomaly identification. By synthesising rather than replicating examples, the technique increases effective training set size whilst enabling classifiers to learn decision boundaries more effectively without overfitting to genuine minority patterns.
Common Applications
Applications include credit card fraud detection, medical diagnosis with rare diseases, network intrusion detection, and manufacturing defect identification. Telecommunications and banking sectors regularly employ the technique to improve detection of rare but costly adverse events.
Key Considerations
The method assumes minority class samples are sufficiently dense to form meaningful neighbourhoods; sparse or highly scattered minority data may produce poor-quality synthetics. Generated samples exist in interpolated regions that may not reflect true underlying data distribution, and parameter tuning (particularly k and over-sampling ratio) significantly influences results.
Cross-References(1)
More in Machine Learning
UMAP
Unsupervised LearningUniform Manifold Approximation and Projection — a dimensionality reduction technique for visualisation and general non-linear reduction.
Dimensionality Reduction
Unsupervised LearningTechniques that reduce the number of input variables in a dataset while preserving essential information and structure.
Reinforcement Learning
MLOps & ProductionA machine learning paradigm where agents learn optimal behaviour through trial and error, receiving rewards or penalties.
Markov Decision Process
Reinforcement LearningA mathematical framework for modelling sequential decision-making where outcomes are partly random and partly controlled.
t-SNE
Unsupervised Learningt-Distributed Stochastic Neighbour Embedding — a technique for visualising high-dimensional data in two or three dimensions.
Loss Function
Training TechniquesA mathematical function that measures the difference between predicted outputs and actual target values during model training.
Model Serving
MLOps & ProductionThe infrastructure and processes for deploying trained machine learning models to production environments for real-time predictions.
Ridge Regression
Training TechniquesA regularised regression technique that adds an L2 penalty term to prevent overfitting by constraining coefficient magnitudes.