Clustering — Technology Wiki

Overview

Direct Answer

Clustering is an unsupervised learning technique that partitions datasets into groups of similar data points without requiring predefined class labels. It identifies inherent patterns and structures within data by measuring similarity or distance between observations.

How It Works

Clustering algorithms compute similarity metrics (such as Euclidean distance or cosine similarity) between data points and iteratively assign observations to groups that minimise within-group variance or maximise cohesion. Common approaches include centroid-based methods like K-means, density-based methods like DBSCAN, and hierarchical approaches that build dendrograms of nested partitions.

Why It Matters

Organisations use clustering to discover hidden customer segments, reduce dimensionality for downstream analysis, and identify anomalies without manual labelling costs. It enables data-driven decision-making in scenarios where ground truth is unavailable or expensive to obtain.

Common Applications

Applications include customer segmentation in retail and marketing, genomic sequence grouping in bioinformatics, document organisation in information retrieval, and anomaly detection in cybersecurity. It supports image segmentation in computer vision and helps identify disease subtypes in medical research.

Key Considerations

Practitioners must select appropriate distance metrics and algorithm families based on data geometry, as results are sensitive to initialisation and feature scaling. Determining the optimal number of clusters remains a fundamental challenge requiring domain expertise and validation metrics like silhouette scores.

Cross-References(1)

Machine Learning

Unsupervised Learning

Referenced By4 terms mention Clustering

Other entries in the wiki whose definition references Clustering — useful for understanding how this concept connects across Machine Learning and adjacent domains.

Blockchain Forensics·Blockchain & DLT DBSCAN·Machine Learning Hierarchical Clustering·Machine Learning Text Embedding Model·Natural Language Processing

Related in Unsupervised Learning

Dimensionality Reduction

Techniques that reduce the number of input variables in a dataset while preserving essential information and structure.

Principal Component Analysis

A dimensionality reduction technique that transforms data into orthogonal components ordered by the amount of variance they explain.

t-SNE

t-Distributed Stochastic Neighbour Embedding — a technique for visualising high-dimensional data in two or three dimensions.

UMAP

Uniform Manifold Approximation and Projection — a dimensionality reduction technique for visualisation and general non-linear reduction.

K-Means Clustering

A partitioning algorithm that divides data into k clusters by minimising the distance between points and their cluster centroids.

DBSCAN

Density-Based Spatial Clustering of Applications with Noise — a clustering algorithm that finds arbitrarily shaped clusters based on density.

Hierarchical Clustering

A clustering method that builds a tree-like hierarchy of clusters through successive merging or splitting of groups.

Association Rule Learning

A method for discovering interesting relationships and patterns between variables in large datasets.

Collaborative Filtering

A recommendation technique that makes predictions based on the collective preferences and behaviour of many users.

Content-Based Filtering

A recommendation approach that suggests items similar to those a user has previously liked, based on item attributes.

Matrix Factorisation

A technique that decomposes a matrix into constituent matrices, widely used in recommendation systems and dimensionality reduction.

More in Machine Learning

Semi-Supervised Learning

Advanced Methods

A learning approach that combines a small amount of labelled data with a large amount of unlabelled data during training.

Batch Learning

MLOps & Production

Training a machine learning model on the entire dataset at once before deployment, as opposed to incremental updates.

A/B Testing

Training Techniques

A controlled experiment comparing two variants to determine which performs better against a defined metric.

Ridge Regression

Training Techniques

A regularised regression technique that adds an L2 penalty term to prevent overfitting by constraining coefficient magnitudes.

Bias-Variance Tradeoff

Training Techniques

The balance between a model's ability to minimise bias (error from assumptions) and variance (sensitivity to training data fluctuations).

Bandit Algorithm

Advanced Methods

An online learning algorithm that balances exploration of new options with exploitation of known good options to maximise reward.

Markov Decision Process

Reinforcement Learning

A mathematical framework for modelling sequential decision-making where outcomes are partly random and partly controlled.

Lasso Regression

Feature Engineering & Selection

A regularised regression technique that adds an L1 penalty, enabling feature selection by driving some coefficients to zero.