SGLang (LLM Inference Engine) Cheat Sheet

Updated 2026-05-21

Next Topic: Speculative Decoding and LLM Serving Optimization Cheat Sheet

SGLang is an open-source, high-performance serving framework for large language models and multimodal models, developed at UC Berkeley and stewarded by LMSYS. It addresses the core bottleneck in LLM inference — redundant computation when requests share common prefixes — through RadixAttention, a radix-tree KV cache that automatically identifies and reuses overlapping token sequences across concurrent requests. Beyond caching, SGLang is uniquely co-designed: a Python-embedded frontend DSL for expressing multi-step programs (fork/join, constrained generation, conditional branching) is tightly coupled to a backend runtime that translates those programs into optimally batched GPU work. The critical insight is that structured outputs, prefix reuse, and parallelism are not independent features bolted together but a unified system — and that co-design is what enables SGLang to reach up to 6.4× higher throughput than vLLM on prefix-heavy and agentic workloads.

What This Cheat Sheet Covers

This topic spans 15 focused tables and 123 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Core Runtime ArchitectureTable 2: Structured Output GenerationTable 3: Frontend Python DSLTable 4: OpenAI-Compatible HTTP Server & API EndpointsTable 5: Sampling ParametersTable 6: Parallelism StrategiesTable 7: Speculative DecodingTable 8: QuantizationTable 9: Multi-LoRA ServingTable 10: Prefill-Decode DisaggregationTable 11: Deployment — Installation & Server LaunchTable 12: Key Server ArgumentsTable 13: Model Gateway (Router) & Load BalancingTable 14: Multimodal & Diffusion SupportTable 15: Performance Benchmarks & Production Numbers

Table 1: Core Runtime Architecture

RadixAttention and the zero-overhead scheduler are the two foundational innovations distinguishing SGLang from other LLM runtimes. Together they ensure that neither KV computation nor GPU time is wasted between requests.

Technique	Example	Description
RadixAttention	`python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --enable-prefix-caching`	• Stores KV cache tensors in a radix tree (trie) keyed by token sequences • new requests walk the tree to find the longest matching prefix and reuse its KV cache without recomputation
LRU eviction policy	Tree fills GPU memory → least-recently-used subtrees evicted first	• When radix tree nodes fill VRAM, SGLang evicts nodes by recency • leaf nodes are evicted before shared interior nodes to preserve the most broadly reusable prefixes
Cache-aware scheduling	Incoming request matched to cached prefix → dispatched ahead of others with no prefix match	Scheduler prioritizes requests that produce the highest prefix cache hit, reaching ~96% of optimal hit rate in production.
Zero-overhead CPU scheduler	GPU processes batch N → CPU prepares batch N+1 in parallel	• Dedicated scheduler process overlaps batch preparation with GPU execution, eliminating idle GPU time • speedup is most significant on small models and large TP configurations

Table 1: Core Runtime Architecture

Technique	Example	Description
RadixAttention	`python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --enable-prefix-caching`	• Stores KV cache tensors in a radix tree (trie) keyed by token sequences • new requests walk the tree to find the longest matching prefix and reuse its KV cache without recomputation
LRU eviction policy	Tree fills GPU memory → least-recently-used subtrees evicted first	• When radix tree nodes fill VRAM, SGLang evicts nodes by recency • leaf nodes are evicted before shared interior nodes to preserve the most broadly reusable prefixes
Cache-aware scheduling	Incoming request matched to cached prefix → dispatched ahead of others with no prefix match	Scheduler prioritizes requests that produce the highest prefix cache hit, reaching ~96% of optimal hit rate in production.
Zero-overhead CPU scheduler	GPU processes batch N → CPU prepares batch N+1 in parallel	• Dedicated scheduler process overlaps batch preparation with GPU execution, eliminating idle GPU time • speedup is most significant on small models and large TP configurations