Overview
Direct Answer
Clustering is an unsupervised learning technique that partitions datasets into groups of similar data points without requiring predefined class labels. It identifies inherent patterns and structures within data by measuring similarity or distance between observations.
How It Works
Clustering algorithms compute similarity metrics (such as Euclidean distance or cosine similarity) between data points and iteratively assign observations to groups that minimise within-group variance or maximise cohesion. Common approaches include centroid-based methods like K-means, density-based methods like DBSCAN, and hierarchical approaches that build dendrograms of nested partitions.
Why It Matters
Organisations use clustering to discover hidden customer segments, reduce dimensionality for downstream analysis, and identify anomalies without manual labelling costs. It enables data-driven decision-making in scenarios where ground truth is unavailable or expensive to obtain.
Common Applications
Applications include customer segmentation in retail and marketing, genomic sequence grouping in bioinformatics, document organisation in information retrieval, and anomaly detection in cybersecurity. It supports image segmentation in computer vision and helps identify disease subtypes in medical research.
Key Considerations
Practitioners must select appropriate distance metrics and algorithm families based on data geometry, as results are sensitive to initialisation and feature scaling. Determining the optimal number of clusters remains a fundamental challenge requiring domain expertise and validation metrics like silhouette scores.
Cross-References(1)
Referenced By4 terms mention Clustering
Other entries in the wiki whose definition references Clustering — useful for understanding how this concept connects across Machine Learning and adjacent domains.
More in Machine Learning
Semi-Supervised Learning
Advanced MethodsA learning approach that combines a small amount of labelled data with a large amount of unlabelled data during training.
Batch Learning
MLOps & ProductionTraining a machine learning model on the entire dataset at once before deployment, as opposed to incremental updates.
A/B Testing
Training TechniquesA controlled experiment comparing two variants to determine which performs better against a defined metric.
Ridge Regression
Training TechniquesA regularised regression technique that adds an L2 penalty term to prevent overfitting by constraining coefficient magnitudes.
Bias-Variance Tradeoff
Training TechniquesThe balance between a model's ability to minimise bias (error from assumptions) and variance (sensitivity to training data fluctuations).
Bandit Algorithm
Advanced MethodsAn online learning algorithm that balances exploration of new options with exploitation of known good options to maximise reward.
Markov Decision Process
Reinforcement LearningA mathematical framework for modelling sequential decision-making where outcomes are partly random and partly controlled.
Lasso Regression
Feature Engineering & SelectionA regularised regression technique that adds an L1 penalty, enabling feature selection by driving some coefficients to zero.