Scikit-Learn Cheat Sheet

Updated 2026-04-27

Next Topic: Self-Supervised and Contrastive Learning Cheat Sheet

🧠Study flashcards on this topic133 cards · spaced repetition→

Scikit-learn (sklearn) is Python's most widely adopted machine learning library, built on NumPy, SciPy, and Matplotlib to provide simple, efficient tools for predictive data analysis. It offers a unified, consistent API across hundreds of algorithms — from linear regression to Gaussian processes — along with essential preprocessing, model selection, and evaluation utilities. Scikit-learn's ease of use and production-ready implementations make it the go-to library for both rapid prototyping and deploying ML models at scale, covering supervised learning (classification, regression), unsupervised learning (clustering, dimensionality reduction), semi-supervised learning, and the full pipeline of data preparation to model deployment.

What This Cheat Sheet Covers

This topic spans 23 focused tables and 162 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Supervised Learning — Classification AlgorithmsTable 2: Supervised Learning — Regression AlgorithmsTable 3: Unsupervised Learning — Clustering AlgorithmsTable 4: Semi-Supervised LearningTable 5: Unsupervised Learning — Dimensionality ReductionTable 6: Clustering Evaluation MetricsTable 7: Data Preprocessing — Scaling and NormalizationTable 8: Data Preprocessing — Encoding Categorical VariablesTable 9: Data Preprocessing — Handling Missing ValuesTable 10: Feature Engineering and SelectionTable 11: Model Selection — Train/Test SplittingTable 12: Model Selection — Cross-ValidationTable 13: Hyperparameter TuningTable 14: Pipelines and CompositionTable 15: Classification MetricsTable 16: Regression MetricsTable 17: Text Feature ExtractionTable 18: Ensemble MethodsTable 19: Probability CalibrationTable 20: Multiclass and Multilabel StrategiesTable 21: Model InspectionTable 22: Advanced TechniquesTable 23: Model Persistence and Deployment

Quick IndexSubscribe to unlock

A jump-to index of every table row in this cheat sheet.

Mind MapSubscribe to unlock

An interactive map of every table and concept in this topic.

Table 1: Supervised Learning — Classification Algorithms

Classification is the most common starting point in scikit-learn — every estimator here predicts a discrete label and shares the same fit/predict rhythm. The list spans the full spectrum you'll actually reach for: fast linear baselines like Logistic Regression and Perceptron, the tree ensembles (Random Forest, Gradient Boosting, AdaBoost) that win most tabular problems, kernel SVMs for tricky boundaries, and the Naive Bayes family for text. A useful mental shortcut is to start simple, then trade interpretability for accuracy as you move down the table.

Algorithm	Example	Description
Logistic Regression	`from sklearn.linear_model import LogisticRegression` `clf = LogisticRegression()` `clf.fit(X_train, y_train)`	• Binary or multiclass linear classifier using logistic function to model probability • supports L1, L2, or ElasticNet regularization to prevent overfitting.
Random Forest Classifier	`from sklearn.ensemble import RandomForestClassifier` `clf = RandomForestClassifier(n_estimators=100)` `clf.fit(X_train, y_train)`	• Ensemble of decision trees trained on bootstrap samples with random feature subsets • averages predictions to reduce variance and provides feature importance scores.
Gradient Boosting Classifier	`from sklearn.ensemble import GradientBoostingClassifier` `clf = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1)` `clf.fit(X_train, y_train)`	• Sequentially builds trees where each corrects errors of previous ones • learning rate controls contribution of each tree; powerful but sensitive to overfitting.
Support Vector Classifier (SVC)	`from sklearn.svm import SVC` `clf = SVC(kernel='rbf', C=1.0)` `clf.fit(X_train, y_train)`	• Finds optimal hyperplane separating classes • uses kernel trick (linear, RBF, polynomial, sigmoid) for non-linear boundaries • C parameter controls margin vs. misclassification trade-off.
Decision Tree Classifier	`from sklearn.tree import DecisionTreeClassifier` `clf = DecisionTreeClassifier(max_depth=5)` `clf.fit(X_train, y_train)`	• Recursively splits data based on feature thresholds to minimize impurity (Gini or entropy) • interpretable but prone to overfitting without depth limits.
K-Nearest Neighbors Classifier	`from sklearn.neighbors import KNeighborsClassifier` `clf = KNeighborsClassifier(n_neighbors=5)` `clf.fit(X_train, y_train)`	• Non-parametric lazy learner assigning class by majority vote of k nearest neighbors • distance-based — requires feature scaling for optimal results.
Naive Bayes — Gaussian	`from sklearn.naive_bayes import GaussianNB` `clf = GaussianNB()` `clf.fit(X_train, y_train)`	• Assumes features follow Gaussian distribution; applies Bayes' theorem with naive independence assumption • fast and effective for continuous features.
Naive Bayes — Multinomial	`from sklearn.naive_bayes import MultinomialNB` `clf = MultinomialNB(alpha=1.0)` `clf.fit(X_train, y_train)`	• Designed for discrete count data (e.g., word counts) • alpha adds Laplace smoothing; commonly used for document classification.

Table 1: Supervised Learning — Classification Algorithms

Algorithm	Example	Description
Logistic Regression	`from sklearn.linear_model import LogisticRegression` `clf = LogisticRegression()` `clf.fit(X_train, y_train)`	• Binary or multiclass linear classifier using logistic function to model probability • supports L1, L2, or ElasticNet regularization to prevent overfitting.
Random Forest Classifier	`from sklearn.ensemble import RandomForestClassifier` `clf = RandomForestClassifier(n_estimators=100)` `clf.fit(X_train, y_train)`	• Ensemble of decision trees trained on bootstrap samples with random feature subsets • averages predictions to reduce variance and provides feature importance scores.
Gradient Boosting Classifier	`from sklearn.ensemble import GradientBoostingClassifier` `clf = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1)` `clf.fit(X_train, y_train)`	• Sequentially builds trees where each corrects errors of previous ones • learning rate controls contribution of each tree; powerful but sensitive to overfitting.
Support Vector Classifier (SVC)	`from sklearn.svm import SVC` `clf = SVC(kernel='rbf', C=1.0)` `clf.fit(X_train, y_train)`	• Finds optimal hyperplane separating classes • uses kernel trick (linear, RBF, polynomial, sigmoid) for non-linear boundaries • C parameter controls margin vs. misclassification trade-off.
Decision Tree Classifier	`from sklearn.tree import DecisionTreeClassifier` `clf = DecisionTreeClassifier(max_depth=5)` `clf.fit(X_train, y_train)`	• Recursively splits data based on feature thresholds to minimize impurity (Gini or entropy) • interpretable but prone to overfitting without depth limits.
K-Nearest Neighbors Classifier	`from sklearn.neighbors import KNeighborsClassifier` `clf = KNeighborsClassifier(n_neighbors=5)` `clf.fit(X_train, y_train)`	• Non-parametric lazy learner assigning class by majority vote of k nearest neighbors • distance-based — requires feature scaling for optimal results.
Naive Bayes — Gaussian	`from sklearn.naive_bayes import GaussianNB` `clf = GaussianNB()` `clf.fit(X_train, y_train)`	• Assumes features follow Gaussian distribution; applies Bayes' theorem with naive independence assumption • fast and effective for continuous features.
Naive Bayes — Multinomial	`from sklearn.naive_bayes import MultinomialNB` `clf = MultinomialNB(alpha=1.0)` `clf.fit(X_train, y_train)`	• Designed for discrete count data (e.g., word counts) • alpha adds Laplace smoothing; commonly used for document classification.