SGLang is an open-source, high-performance serving framework for large language models and multimodal models, developed at UC Berkeley and stewarded by LMSYS. It addresses the core bottleneck in LLM inference — redundant computation when requests share common prefixes — through RadixAttention, a radix-tree KV cache that automatically identifies and reuses overlapping token sequences across concurrent requests. Beyond caching, SGLang is uniquely co-designed: a Python-embedded frontend DSL for expressing multi-step programs (fork/join, constrained generation, conditional branching) is tightly coupled to a backend runtime that translates those programs into optimally batched GPU work. The critical insight is that structured outputs, prefix reuse, and parallelism are not independent features bolted together but a unified system — and that co-design is what enables SGLang to reach up to 6.4× higher throughput than vLLM on prefix-heavy and agentic workloads.
What This Cheat Sheet Covers
This topic spans 15 focused tables and 123 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Core Runtime Architecture
RadixAttention and the zero-overhead scheduler are the two foundational innovations distinguishing SGLang from other LLM runtimes. Together they ensure that neither KV computation nor GPU time is wasted between requests.
| Technique | Example | Description |
|---|---|---|
python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --enable-prefix-caching | • Stores KV cache tensors in a radix tree (trie) keyed by token sequences • new requests walk the tree to find the longest matching prefix and reuse its KV cache without recomputation | |
Tree fills GPU memory → least-recently-used subtrees evicted first | • When radix tree nodes fill VRAM, SGLang evicts nodes by recency • leaf nodes are evicted before shared interior nodes to preserve the most broadly reusable prefixes | |
Incoming request matched to cached prefix → dispatched ahead of others with no prefix match | Scheduler prioritizes requests that produce the highest prefix cache hit, reaching ~96% of optimal hit rate in production. | |
GPU processes batch N → CPU prepares batch N+1 in parallel | • Dedicated scheduler process overlaps batch preparation with GPU execution, eliminating idle GPU time • speedup is most significant on small models and large TP configurations |