Feature engineering is the process of transforming raw data into meaningful features that improve machine learning model performance. It sits at the intersection of domain knowledge and data science, converting observations into numeric representations that algorithms can interpret. While automated approaches exist, manual feature engineering remains critical—selecting the right transformations, encoding strategies, and scaling methods often determines whether a model achieves mediocre or exceptional results. A key principle is always fitting transformations on training data only, then applying them to test data to prevent data leakage. Understanding these techniques empowers you to extract maximum signal from your data.
What This Cheat Sheet Covers
This topic spans 19 focused tables and 123 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Categorical Encoding Methods
| Method | Example | Description |
|---|---|---|
['red','blue'] →[[1,0],[0,1]] | • Creates binary column for each category • most common encoding for nominal variables with low cardinality | |
['low','med','high'] →[0, 1, 2] | • Maps categories to integers • suitable for ordinal variables where order matters or tree-based models | |
cat → mean(target|cat) | • Replaces category with target mean for that group • powerful for high cardinality but risks overfitting without cross-fitting | |
['small','medium','large'] →[1, 2, 3] | • Assigns ordered integers based on inherent ranking • preserves ordinal relationships | |
cat → count(cat)/total | • Encodes by occurrence frequency • useful when frequency correlates with target | |
[0,1,2,3] →[[0,0],[0,1],[1,0],[1,1]] | • Converts integers to binary digits as columns • reduces dimensionality vs one-hot for high cardinality |