Rate limiting and throttling are essential traffic control mechanisms in distributed systems and APIs that regulate how frequently clients can make requests within a specified timeframe. These patterns protect backend infrastructure from overload, prevent abuse, enforce usage quotas for tiered pricing models, and ensure fair resource allocation across all users. The key distinction is that rate limiting blocks requests once a threshold is exceeded (returning HTTP 429), while throttling slows down or queues excess requests to smooth traffic flow — though in practice many practitioners use the terms interchangeably. Understanding the algorithmic foundations (token bucket, leaky bucket, sliding window), architectural considerations (distributed Redis-backed counters, gossip sync), and protocol-specific nuances (GraphQL complexity analysis, LLM token-per-minute budgets, WebSocket connection limits) is critical for building resilient, scalable APIs that withstand traffic spikes, DDoS attempts, and noisy neighbors without degrading service for legitimate users.
What This Cheat Sheet Covers
This topic spans 19 focused tables and 113 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Core Rate Limiting Algorithms
The five algorithms below form the theoretical foundation of every rate limiter in production. Each trades off memory usage, accuracy, burst tolerance, and implementation complexity differently — choosing the wrong one for your traffic shape is a common source of subtle overload bugs.
| Algorithm | Example | Description |
|---|---|---|
bucket_size = 100tokens = 100refill_rate = 10/secif tokens > 0: allow() | • Bucket holds tokens that refill at a constant rate • allows bursts up to bucket size while maintaining long-term average rate • Most popular choice for production APIs; used by AWS API Gateway and Stripe. | |
prev_count * overlap + curr_countrate = (80*0.5 + 30) / 100 | • Hybrid approach: uses counts from the previous and current fixed windows with a weighted overlap calculation • Approximates sliding window log with far less memory • Best balance of accuracy and efficiency; recommended for most distributed APIs. | |
timestamps = [t1, t2, ...]now = time.now()valid = filter(t > now-60s) | • Stores a timestamp for every request in a rolling window • Precise but memory-intensive for high traffic • No sudden burst at window boundaries; use when accuracy is paramount. |