Scikit-learn pipelines are workflow tools that chain preprocessing transformers and estimators into a single composable object. Located in sklearn.pipeline and sklearn.compose, they ensure reproducible data transformations, prevent data leakage during cross-validation, and streamline hyperparameter tuning. Pipelines enforce that each transformation step learned from training data (scaling means, encoding categories) is applied identically to validation and test folds. Key mental model: think of pipelines as assembly lines where each station (transformer) modifies the data in a consistent, repeatable way β transformers never see test data during fit, only during transform.
What This Cheat Sheet Covers
This topic spans 13 focused tables and 58 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Core Pipeline Classes
| Class | Example | Description |
|---|---|---|
Pipeline([('scaler', StandardScaler()), ('clf', LogisticRegression())]) | Chains transformers sequentially with an optional final estimator; calls fit_transform on each step except last | |
make_pipeline(StandardScaler(), LogisticRegression()) | Convenience constructor that auto-generates step names ('standardscaler', 'logisticregression') instead of requiring tuples | |
ColumnTransformer([('num', StandardScaler(), ['age', 'income']), ('cat', OneHotEncoder(), ['city'])]) | Applies different transformers to different column subsets; concatenates results horizontally into single feature matrix |