Unsupervised learning is a machine learning paradigm where algorithms discover hidden patterns and structures in unlabeled data without predefined outputs or target variables. Unlike supervised learning, these methods work autonomously to identify similarities, groupings, and anomalies across clustering, dimensionality reduction, topic modeling, anomaly detection, and self-supervised representation learning tasks. The field has expanded dramatically with modern deep learning approaches — contrastive and masked self-supervised methods now learn transferable representations rivaling supervised pretraining. The key insight: unsupervised algorithms must balance discovering meaningful structure while avoiding overfitting to noise, making evaluation metrics and domain knowledge essential for interpreting results.
What This Cheat Sheet Covers
This topic spans 11 focused tables and 80 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Core Clustering Algorithms
| Algorithm | Example | Description |
|---|---|---|
from sklearn.cluster import KMeanskmeans = KMeans(n_clusters=3, init='k-means++')labels = kmeans.fit_predict(X) | • Partitions data into k spherical clusters by minimizing within-cluster variance • fast but requires predefined k and assumes equal-sized, convex clusters. | |
from sklearn.cluster import DBSCANdbscan = DBSCAN(eps=0.5, min_samples=5)labels = dbscan.fit_predict(X) | • Density-based clustering finding arbitrary-shaped clusters • marks low-density points as noise (-1) • no need to specify cluster count but sensitive to eps and min_samples. | |
from sklearn.cluster import AgglomerativeClusteringhc = AgglomerativeClustering(n_clusters=3)labels = hc.fit_predict(X) | • Builds a tree of nested clusters (dendrogram) via agglomerative (bottom-up) or divisive (top-down) approach • interpretable but O(n²) memory. | |
from sklearn.mixture import GaussianMixturegmm = GaussianMixture(n_components=3)labels = gmm.fit_predict(X) | • Probabilistic soft clustering assuming data comes from a mixture of Gaussians • yields membership probabilities rather than hard assignments. | |
from sklearn.cluster import HDBSCANclusterer = HDBSCAN(min_cluster_size=5)labels = clusterer.fit_predict(X) | • Hierarchical density-based clustering with robust noise detection and support for varying-density clusters • native in scikit-learn ≥ 1.3 • minimal tuning required. |