Federated Learning (FL) is a distributed machine learning paradigm that enables collaborative model training across decentralized devices or servers without centralizing raw data. Originally introduced by Google in 2016 for improving Gboard's next-word prediction, FL has become foundational for privacy-preserving AI in healthcare, finance, IoT, and edge computing. Unlike traditional centralized learning where all data is uploaded to a central server, FL keeps data local while sharing only model updates—a critical distinction that preserves user privacy, complies with regulations like GDPR and HIPAA, and reduces communication overhead. The key challenge: achieving global model convergence despite heterogeneous data distributions, unreliable network connections, and resource-constrained devices requires sophisticated aggregation algorithms, communication-efficient protocols, and robust defenses against adversarial attacks.
What This Cheat Sheet Covers
This topic spans 24 focused tables and 148 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Core Aggregation Algorithms
The aggregation algorithm is the heart of federated learning—it's the rule the server uses to fuse hundreds of local model updates into one global model. FedAvg is the simple weighted-average baseline everything else builds on; the rest are direct answers to its weaknesses, adding proximal terms, control variates, or server-side adaptive optimizers to keep training stable when clients hold wildly different data.
| Algorithm | Example | Description |
|---|---|---|
w_{t+1} = \sum_{k=1}^{K} \frac{n_k}{n} w_k^t | • Weighted average of local model updates where weights are proportional to client dataset sizes • forms the baseline for most FL algorithms | |
\min_w F(w) + \frac{\mu}{2} \<code>w - w_t\</code>^2 | • Adds proximal term \mu to limit client drift in heterogeneous environments• mitigates impact of clients with radically different local optima due to non-IID data | |
m_t = \beta_1 m_{t-1} + (1-\beta_1) \Delta w_tv_t = \beta_2 v_{t-1} + (1-\beta_2) \Delta w_t^2 | • Applies adaptive momentum at the server using first and second moment estimates • achieves faster convergence than FedAvg on non-convex objectives | |
w_{k,t+1} = w_k^t - \eta (g_k - c_k + c) | • Maintains control variates c_k and global correction c to reduce client drift• achieves linear speedup with respect to number of clients |