Scikit-learn (sklearn) is Python's most widely adopted machine learning library, built on NumPy, SciPy, and Matplotlib to provide simple, efficient tools for predictive data analysis. It offers a unified, consistent API across hundreds of algorithms β from linear regression to Gaussian processes β along with essential preprocessing, model selection, and evaluation utilities. Scikit-learn's ease of use and production-ready implementations make it the go-to library for both rapid prototyping and deploying ML models at scale, covering supervised learning (classification, regression), unsupervised learning (clustering, dimensionality reduction), semi-supervised learning, and the full pipeline of data preparation to model deployment.
What This Cheat Sheet Covers
This topic spans 23 focused tables and 162 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Supervised Learning β Classification Algorithms
| Algorithm | Example | Description |
|---|---|---|
from sklearn.linear_model import LogisticRegressionclf = LogisticRegression()clf.fit(X_train, y_train) | β’ Binary or multiclass linear classifier using logistic function to model probability β’ supports L1, L2, or ElasticNet regularization to prevent overfitting. | |
from sklearn.ensemble import RandomForestClassifierclf = RandomForestClassifier(n_estimators=100)clf.fit(X_train, y_train) | β’ Ensemble of decision trees trained on bootstrap samples with random feature subsets β’ averages predictions to reduce variance and provides feature importance scores. | |
from sklearn.ensemble import GradientBoostingClassifierclf = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1)clf.fit(X_train, y_train) | β’ Sequentially builds trees where each corrects errors of previous ones β’ learning rate controls contribution of each tree; powerful but sensitive to overfitting. | |
from sklearn.svm import SVCclf = SVC(kernel='rbf', C=1.0)clf.fit(X_train, y_train) | β’ Finds optimal hyperplane separating classes β’ uses kernel trick (linear, RBF, polynomial, sigmoid) for non-linear boundaries β’ C parameter controls margin vs. misclassification trade-off. | |
from sklearn.tree import DecisionTreeClassifierclf = DecisionTreeClassifier(max_depth=5)clf.fit(X_train, y_train) | β’ Recursively splits data based on feature thresholds to minimize impurity (Gini or entropy) β’ interpretable but prone to overfitting without depth limits. | |
from sklearn.neighbors import KNeighborsClassifierclf = KNeighborsClassifier(n_neighbors=5)clf.fit(X_train, y_train) | β’ Non-parametric lazy learner assigning class by majority vote of k nearest neighbors β’ distance-based β requires feature scaling for optimal results. | |
from sklearn.naive_bayes import GaussianNBclf = GaussianNB()clf.fit(X_train, y_train) | β’ Assumes features follow Gaussian distribution; applies Bayes' theorem with naive independence assumption β’ fast and effective for continuous features. | |
from sklearn.naive_bayes import MultinomialNBclf = MultinomialNB(alpha=1.0)clf.fit(X_train, y_train) | β’ Designed for discrete count data (e.g., word counts) β’ alpha adds Laplace smoothing; commonly used for document classification. |