Multi-task learning (MTL) trains a single model to solve multiple related tasks simultaneously, leveraging shared representations to improve generalization and sample efficiency across tasks. Multi-label learning tackles problems where each instance can be assigned multiple labels simultaneously (unlike multi-class classification, which assigns exactly one label). Both paradigms share a core insight: explicitly modeling relationships between outputs — whether tasks or labels — improves learning efficiency and prediction accuracy. The key challenge lies in balancing competing objectives: tasks can exhibit positive transfer (helping each other) or negative transfer (hurting performance), while labels can be positively correlated, negatively correlated, or independent. Successful approaches must adapt dynamically to these relationships during training.
What This Cheat Sheet Covers
This topic spans 15 focused tables and 69 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Parameter Sharing Architectures
The whole game of multi-task learning starts with one decision: how much of the network should tasks hold in common versus keep to themselves. These architectures span that spectrum — from a single shared trunk with thin task heads, through learnable mixing of separate columns, to bottleneck adapters that bolt onto a frozen backbone — each trading off parameter cost against the freedom to let tasks diverge when they conflict.
| Architecture | Example | Description |
|---|---|---|
shared_encoder → [task_head_1, task_head_2, ..., task_head_n] | • Shared bottom layers with task-specific output heads • reduces overfitting risk by factor of N (tasks) but vulnerable to negative transfer | |
encoder_1 ↔ encoder_2 ↔ ... ↔ encoder_n | • Each task has separate parameters with regularization encouraging similarity • more flexible but higher parameter count | |
task_A_features = α·A + β·Btask_B_features = γ·A + δ·B | • Learnable linear combinations of task-specific features at each layer • learns optimal information sharing automatically | |
attention_mask_i = σ(Conv(shared_features))task_i_features = attention_mask_i ⊙ shared_features | • Task-specific attention masks applied to shared features • allows selective feature usage per task with parameter efficiency |