Mixture of Experts (MoE) architecture transforms large-scale neural networks by conditionally activating only subsets of parameters per input, enabling trillion-parameter models at a fraction of the computational cost of dense models. Originally proposed for machine learning ensembles, modern MoE has become the backbone of state-of-the-art language models like DeepSeek-V3, Mixtral, and GPT-4 (rumored). The core innovation: a gating network dynamically routes each token to specialized expert subnetworks, activating perhaps 10% of total parameters while maintaining performance comparable to models 5-10× larger. This creates a sparse activation pattern where computational cost scales with active parameters, not total capacity—a fundamentally different scaling law than dense transformers. Understanding MoE means grasping the tension between expert specialization (routing precision), load balancing (preventing expert collapse), and communication efficiency (all-to-all bottlenecks in distributed training).
What This Cheat Sheet Covers
This topic spans 12 focused tables and 99 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Core MoE Architecture Components
MoE models replace standard feed-forward network (FFN) layers in transformers with specialized expert layers controlled by a learned router. Each component serves a distinct architectural role, from input routing to output aggregation, and understanding their interplay is essential for both training and inference optimization.
| Component | Example | Description |
|---|---|---|
expert = nn.Linear(d_model, d_ff)experts = [expert_i for i in range(N)] | Specialized feed-forward subnetwork processing a subset of inputs; typically a 2-layer MLP identical in structure to standard transformer FFNs but trained to specialize via routing. | |
router_logits = W_gate @ token_embedrouter_probs = softmax(router_logits) | Learned neural network (usually single linear layer + softmax) that computes routing scores for each token-expert pair; determines which experts process each input. | |
top_k_indices = torch.topk(router_probs, k=2)selected_experts = [experts[i] for i in top_k_indices] | Routing strategy selecting the k highest-scoring experts per token; typical values are k=1 (Switch Transformer) or k=2 (Mixtral), balancing specialization vs robustness. | |
capacity = (batch_size * seq_len / num_experts) * capacity_factor | Maximum number of tokens an expert can process per batch; prevents memory overflow but causes token dropping when exceeded, typically set to 1.25–2.0× average load. |