Scikit-learn pipelines are workflow tools that chain preprocessing transformers and estimators into a single composable object. Located in sklearn.pipeline and sklearn.compose, they ensure reproducible data transformations, prevent data leakage during cross-validation, and streamline hyperparameter tuning. Pipelines enforce that each transformation step learned from training data (scaling means, encoding categories) is applied identically to validation and test folds. Key mental model: think of pipelines as assembly lines where each station (transformer) modifies the data in a consistent, repeatable way — transformers never see test data during fit, only during transform.
What This Cheat Sheet Covers
This topic spans 13 focused tables and 58 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Core Pipeline Classes
These are the building blocks you compose everything else from. Pipeline chains steps in sequence, ColumnTransformer routes different columns to different transformers, and FeatureUnion runs transformers in parallel and stitches their outputs together — with make_* shortcuts that auto-name steps so you can skip the boilerplate tuples. Master these five and the rest of the library snaps into place around them.
| Class | Example | Description |
|---|---|---|
Pipeline([('scaler', StandardScaler()), ('clf', LogisticRegression())]) | • Chains transformers sequentially with an optional final estimator • calls fit_transform on each step except last | |
make_pipeline(StandardScaler(), LogisticRegression()) | Convenience constructor that auto-generates step names ( 'standardscaler', 'logisticregression') instead of requiring tuples | |
ColumnTransformer([('num', StandardScaler(), ['age', 'income']), ('cat', OneHotEncoder(), ['city'])]) | • Applies different transformers to different column subsets • concatenates results horizontally into single feature matrix |