SGLang is an open-source, high-performance serving framework for large language models and multimodal models, developed at UC Berkeley and stewarded by LMSYS. It addresses the core bottleneck in LLM inference β redundant computation when requests share common prefixes β through RadixAttention, a radix-tree KV cache that automatically identifies and reuses overlapping token sequences across concurrent requests. Beyond caching, SGLang is uniquely co-designed: a Python-embedded frontend DSL for expressing multi-step programs (fork/join, constrained generation, conditional branching) is tightly coupled to a backend runtime that translates those programs into optimally batched GPU work. The critical insight is that structured outputs, prefix reuse, and parallelism are not independent features bolted together but a unified system β and that co-design is what enables SGLang to reach up to 6.4Γ higher throughput than vLLM on prefix-heavy and agentic workloads.
What This Cheat Sheet Covers
This topic spans 15 focused tables and 123 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Core Runtime Architecture
RadixAttention and the zero-overhead scheduler are the two foundational innovations distinguishing SGLang from other LLM runtimes. Together they ensure that neither KV computation nor GPU time is wasted between requests.
| Technique | Example | Description |
|---|---|---|
python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --enable-prefix-caching | Stores KV cache tensors in a radix tree (trie) keyed by token sequences; new requests walk the tree to find the longest matching prefix and reuse its KV cache without recomputation. | |
Tree fills GPU memory β least-recently-used subtrees evicted first | When radix tree nodes fill VRAM, SGLang evicts nodes by recency; leaf nodes are evicted before shared interior nodes to preserve the most broadly reusable prefixes. | |
Incoming request matched to cached prefix β dispatched ahead of others with no prefix match | Scheduler prioritizes requests that produce the highest prefix cache hit, reaching ~96% of optimal hit rate in production. | |
GPU processes batch N β CPU prepares batch N+1 in parallel | Dedicated scheduler process overlaps batch preparation with GPU execution, eliminating idle GPU time; speedup is most significant on small models and large TP configurations. |