Speculative decoding and LLM serving optimization represent critical techniques for accelerating inference in production language models. As models grow to hundreds of billions of parameters, the memory-bandwidth bottleneck during token generation becomes the primary constraint—not compute throughput. Modern serving systems address this through a combination of architectural innovations (PagedAttention, Flash Attention), algorithmic techniques (speculative decoding, continuous batching), and hardware-aware optimizations (quantization, parallelism). The key insight: decode is memory-bound, prefill is compute-bound—different phases require fundamentally different optimization strategies, and systems that blend both phases intelligently (chunked prefills, disaggregated serving) achieve 2-5× throughput gains over naive implementations.
What This Cheat Sheet Covers
This topic spans 14 focused tables and 75 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Speculative Decoding Fundamentals
Speculative decoding accelerates LLM inference by predicting multiple future tokens with a fast draft model, then verifying them in parallel with the target model. This technique exploits the parallel nature of prefill to amortize the cost of sequential decode, achieving 2-3× speedups with mathematically identical outputs.
| Technique | Example | Description |
|---|---|---|
draft_model.generate(k=5)target_model.verify(draft_tokens) | • Smaller, faster draft model proposes k tokens • larger target model verifies all proposals in one parallel forward pass • accept/reject based on probability distribution matching | |
acceptance_rate = 0.75 | • Fraction of draft tokens accepted • depends on draft-target alignment • rates above 60% yield significant speedup • below 40% adds overhead | |
draft_tokens = ngram_match(context, n=4) | • Uses n-gram matching from KV cache instead of separate draft model • zero cost draft generation • effective for repetitive text patterns |