Speculative decoding and LLM serving optimization represent critical techniques for accelerating inference in production language models. As models grow to hundreds of billions of parameters, the memory-bandwidth bottleneck during token generation becomes the primary constraint—not compute throughput. Modern serving systems address this through a combination of architectural innovations (PagedAttention, Flash Attention), algorithmic techniques (speculative decoding, continuous batching), and hardware-aware optimizations (quantization, parallelism). The key insight: decode is memory-bound, prefill is compute-bound—different phases require fundamentally different optimization strategies, and systems that blend both phases intelligently (chunked prefills, disaggregated serving) achieve 2-5× throughput gains over naive implementations.
What This Cheat Sheet Covers
This topic spans 14 focused tables and 75 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Speculative Decoding Fundamentals
Speculative decoding accelerates LLM inference by predicting multiple future tokens with a fast draft model, then verifying them in parallel with the target model. This technique exploits the parallel nature of prefill to amortize the cost of sequential decode, achieving 2-3× speedups with mathematically identical outputs.
| Technique | Example | Description |
|---|---|---|
draft_model.generate(k=5)target_model.verify(draft_tokens) | Smaller, faster draft model proposes k tokens; larger target model verifies all proposals in one parallel forward pass; accept/reject based on probability distribution matching | |
acceptance_rate = 0.75 | Fraction of draft tokens accepted; depends on draft-target alignment; rates above 60% yield significant speedup; below 40% adds overhead | |
draft_tokens = ngram_match(context, n=4) | Uses n-gram matching from KV cache instead of separate draft model; zero cost draft generation; effective for repetitive text patterns |