Loss functions are the fundamental optimization objectives that guide neural network training by quantifying the discrepancy between predicted and target outputs. The choice of loss function profoundly impacts model convergence, generalization performance, and the specific behaviors the network learns — different tasks demand different mathematical formulations to properly align gradient signals with the desired outcomes. Beyond standard regression and classification losses, modern deep learning employs specialized losses for metric learning, self-supervised pretraining, imbalanced datasets, multi-task scenarios, and probabilistic modeling. Understanding when to use MSE versus Huber, cross-entropy versus focal loss, or contrastive versus triplet losses — and how to implement custom differentiable objectives — is essential for achieving state-of-the-art results across computer vision, NLP, and other domains.
What This Cheat Sheet Covers
This topic spans 14 focused tables and 91 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Regression Losses — Basic
The starting point for any continuous-value prediction. The central tension here is how each loss treats large errors: MSE squares them and so chases outliers hard, while MAE penalizes uniformly and stays robust. RMSE just brings MSE back to the target's original units for interpretability, and MSLE and MAPE shift the focus to relative rather than absolute error — handy when your targets span many orders of magnitude.
| Loss | Example | Description |
|---|---|---|
loss = torch.nn.MSELoss()L = ((y_pred - y)**2).mean() | • L2 loss that heavily penalizes large errors due to squaring • sensitive to outliers and commonly used for continuous value prediction | |
loss = torch.nn.L1Loss()L = (y_pred - y).abs().mean() | • L1 loss providing uniform penalty across all error magnitudes • more robust to outliers than MSE and produces median-like predictions | |
L = torch.sqrt(((y_pred - y)**2).mean()) | • Square root of MSE that returns error in the original scale of the target variable • mathematically equivalent to MSE for optimization but more interpretable |