Scikit-learn Pipelines and Preprocessing Cheat Sheet

Back to Data ScienceUpdated 2026-05-15

Scikit-learn pipelines are workflow tools that chain preprocessing transformers and estimators into a single composable object. Located in sklearn.pipeline and sklearn.compose, they ensure reproducible data transformations, prevent data leakage during cross-validation, and streamline hyperparameter tuning. Pipelines enforce that each transformation step learned from training data (scaling means, encoding categories) is applied identically to validation and test folds. Key mental model: think of pipelines as assembly lines where each station (transformer) modifies the data in a consistent, repeatable way — transformers never see test data during fit, only during transform.

What This Cheat Sheet Covers

This topic spans 13 focused tables and 58 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Core Pipeline ClassesTable 2: Numerical Scaling TransformersTable 3: Categorical Encoding TransformersTable 4: Missing Data ImputationTable 5: Feature Selection within PipelinesTable 6: Advanced Feature EngineeringTable 7: Custom TransformersTable 8: Hyperparameter Tuning with PipelinesTable 9: Cross-Validation with PipelinesTable 10: Pipeline Introspection and UtilitiesTable 11: ColumnTransformer Advanced FeaturesTable 12: Memory Caching with JoblibTable 13: Common Pipeline Patterns

Table 1: Core Pipeline Classes

Class	Example	Description
Pipeline	`Pipeline([('scaler', StandardScaler()), ('clf', LogisticRegression())])`	Chains transformers sequentially with an optional final estimator; calls `fit_transform` on each step except last
make_pipeline	`make_pipeline(StandardScaler(), LogisticRegression())`	Convenience constructor that auto-generates step names (`'standardscaler'`, `'logisticregression'`) instead of requiring tuples
ColumnTransformer	`ColumnTransformer([('num', StandardScaler(), ['age', 'income']), ('cat', OneHotEncoder(), ['city'])])`	Applies different transformers to different column subsets; concatenates results horizontally into single feature matrix

Table 1: Core Pipeline Classes

Class	Example	Description
Pipeline	`Pipeline([('scaler', StandardScaler()), ('clf', LogisticRegression())])`	Chains transformers sequentially with an optional final estimator; calls `fit_transform` on each step except last
make_pipeline	`make_pipeline(StandardScaler(), LogisticRegression())`	Convenience constructor that auto-generates step names (`'standardscaler'`, `'logisticregression'`) instead of requiring tuples
ColumnTransformer	`ColumnTransformer([('num', StandardScaler(), ['age', 'income']), ('cat', OneHotEncoder(), ['city'])])`	Applies different transformers to different column subsets; concatenates results horizontally into single feature matrix