Speculative Decoding and LLM Serving Optimization Cheat Sheet

Updated 2026-05-18

Next Topic: Speech-to-Text (ASR) Models Cheat Sheet

Speculative decoding and LLM serving optimization represent critical techniques for accelerating inference in production language models. As models grow to hundreds of billions of parameters, the memory-bandwidth bottleneck during token generation becomes the primary constraint—not compute throughput. Modern serving systems address this through a combination of architectural innovations (PagedAttention, Flash Attention), algorithmic techniques (speculative decoding, continuous batching), and hardware-aware optimizations (quantization, parallelism). The key insight: decode is memory-bound, prefill is compute-bound—different phases require fundamentally different optimization strategies, and systems that blend both phases intelligently (chunked prefills, disaggregated serving) achieve 2-5× throughput gains over naive implementations.

What This Cheat Sheet Covers

This topic spans 14 focused tables and 75 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Speculative Decoding FundamentalsTable 2: Draft Model SelectionTable 3: KV Cache Management StrategiesTable 4: KV Cache Compression TechniquesTable 5: Memory-Efficient Attention AlgorithmsTable 6: Batching Strategies for LLM ServingTable 7: Serving Framework ComparisonTable 8: Quantization Methods for Production ServingTable 9: Parallelism Strategies for Multi-GPU InferenceTable 10: Attention Architecture VariantsTable 11: Prefill-Decode Optimization TechniquesTable 12: Request Scheduling and Load BalancingTable 13: Production Deployment OptimizationsTable 14: Performance Metrics and Benchmarking

Table 1: Speculative Decoding Fundamentals

Speculative decoding accelerates LLM inference by predicting multiple future tokens with a fast draft model, then verifying them in parallel with the target model. This technique exploits the parallel nature of prefill to amortize the cost of sequential decode, achieving 2-3× speedups with mathematically identical outputs.

Technique	Example	Description
Draft-Target Speculative Decoding	`draft_model.generate(k=5)` `target_model.verify(draft_tokens)`	• Smaller, faster draft model proposes k tokens • larger target model verifies all proposals in one parallel forward pass • accept/reject based on probability distribution matching
Acceptance Rate	`acceptance_rate = 0.75`	• Fraction of draft tokens accepted • depends on draft-target alignment • rates above 60% yield significant speedup • below 40% adds overhead
N-Gram Speculation	`draft_tokens = ngram_match(context, n=4)`	• Uses n-gram matching from KV cache instead of separate draft model • zero cost draft generation • effective for repetitive text patterns

Table 1: Speculative Decoding Fundamentals

Technique	Example	Description
Draft-Target Speculative Decoding	`draft_model.generate(k=5)` `target_model.verify(draft_tokens)`	• Smaller, faster draft model proposes k tokens • larger target model verifies all proposals in one parallel forward pass • accept/reject based on probability distribution matching
Acceptance Rate	`acceptance_rate = 0.75`	• Fraction of draft tokens accepted • depends on draft-target alignment • rates above 60% yield significant speedup • below 40% adds overhead
N-Gram Speculation	`draft_tokens = ngram_match(context, n=4)`	• Uses n-gram matching from KV cache instead of separate draft model • zero cost draft generation • effective for repetitive text patterns