Mixture of Experts (MoE) Architecture Cheat Sheet

Updated 2026-05-18

Next Topic: ML Data Management and Data-Centric AI Cheat Sheet

Mixture of Experts (MoE) architecture transforms large-scale neural networks by conditionally activating only subsets of parameters per input, enabling trillion-parameter models at a fraction of the computational cost of dense models. Originally proposed for machine learning ensembles, modern MoE has become the backbone of state-of-the-art language models like DeepSeek-V3, Mixtral, and GPT-4 (rumored). The core innovation: a gating network dynamically routes each token to specialized expert subnetworks, activating perhaps 10% of total parameters while maintaining performance comparable to models 5-10× larger. This creates a sparse activation pattern where computational cost scales with active parameters, not total capacity—a fundamentally different scaling law than dense transformers. Understanding MoE means grasping the tension between expert specialization (routing precision), load balancing (preventing expert collapse), and communication efficiency (all-to-all bottlenecks in distributed training).

What This Cheat Sheet Covers

This topic spans 12 focused tables and 99 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Core MoE Architecture ComponentsTable 2: Routing Mechanisms and StrategiesTable 3: Load Balancing and Auxiliary Loss FunctionsTable 4: Training Stability and OptimizationTable 5: Communication and Distributed TrainingTable 6: Notable MoE Model ArchitecturesTable 7: Sparse vs Dense Model ComparisonTable 8: MoE for Multi-Task and Domain AdaptationTable 9: Inference Optimization and DeploymentTable 10: Advanced Routing and Gating VariantsTable 11: MoE Compression and Efficiency TechniquesTable 12: Debugging and Monitoring MoE Training

Table 1: Core MoE Architecture Components

MoE models replace standard feed-forward network (FFN) layers in transformers with specialized expert layers controlled by a learned router. Each component serves a distinct architectural role, from input routing to output aggregation, and understanding their interplay is essential for both training and inference optimization.

Component	Example	Description
Expert network	`expert = nn.Linear(d_model, d_ff)` `experts = [expert_i for i in range(N)]`	• Specialized feed-forward subnetwork processing a subset of inputs • typically a 2-layer MLP identical in structure to standard transformer FFNs but trained to specialize via routing
Gating network (router)	`router_logits = W_gate @ token_embed` `router_probs = softmax(router_logits)`	• Learned neural network (usually single linear layer + softmax) that computes routing scores for each token-expert pair • determines which experts process each input
Top-k routing	`top_k_indices = torch.topk(router_probs, k=2)` `selected_experts = [experts[i] for i in top_k_indices]`	• Routing strategy selecting the k highest-scoring experts per token • typical values are k=1 (Switch Transformer) or k=2 (Mixtral), balancing specialization vs robustness
Expert capacity	`capacity = (batch_size * seq_len / num_experts) * capacity_factor`	• Maximum number of tokens an expert can process per batch • prevents memory overflow but causes token dropping when exceeded, typically set to 1.25–2.0× average load.

Table 1: Core MoE Architecture Components

Component	Example	Description
Expert network	`expert = nn.Linear(d_model, d_ff)` `experts = [expert_i for i in range(N)]`	• Specialized feed-forward subnetwork processing a subset of inputs • typically a 2-layer MLP identical in structure to standard transformer FFNs but trained to specialize via routing
Gating network (router)	`router_logits = W_gate @ token_embed` `router_probs = softmax(router_logits)`	• Learned neural network (usually single linear layer + softmax) that computes routing scores for each token-expert pair • determines which experts process each input
Top-k routing	`top_k_indices = torch.topk(router_probs, k=2)` `selected_experts = [experts[i] for i in top_k_indices]`	• Routing strategy selecting the k highest-scoring experts per token • typical values are k=1 (Switch Transformer) or k=2 (Mixtral), balancing specialization vs robustness
Expert capacity	`capacity = (batch_size * seq_len / num_experts) * capacity_factor`	• Maximum number of tokens an expert can process per batch • prevents memory overflow but causes token dropping when exceeded, typically set to 1.25–2.0× average load.