vLLM is an open-source, high-throughput inference and serving engine for large language models, originating from UC Berkeley research. It solves the GPU memory fragmentation problem that plagued earlier serving systems by introducing PagedAttention β an attention algorithm inspired by OS virtual memory paging β which reduces KV-cache memory waste from 60β80% down to under 4%. The key mental model is that vLLM separates the what (the model weights) from the how (memory management and scheduling), letting a single server host many concurrent requests with near-optimal GPU utilization; every optimization β prefix caching, chunked prefill, continuous batching, speculative decoding β ultimately serves this goal of keeping GPU compute saturated while minimizing latency per token.
What This Cheat Sheet Covers
This topic spans 20 focused tables and 163 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Core Architecture Components
The vLLM V1 engine uses a multi-process design that bypasses Python's GIL and separates HTTP handling, scheduling, and GPU execution into dedicated processes. Understanding these components is the foundation for tuning, debugging, and extending vLLM.
| Component | Example | Description |
|---|---|---|
# KV blocks allocated in fixed-size pages | Core memory management algorithm; stores KV-cache in non-contiguous fixed-size blocks (pages), eliminating fragmentation and enabling memory sharing between requests with shared prefixes. | |
# Manages GPU memory pool of KVCacheBlock objects | Allocates, tracks, and evicts KV cache blocks using a doubly-linked free queue for O(1) operations; uses LRU eviction under memory pressure. | |
# Enforces max_num_batched_tokens budget | Maintains a waiting deque and running list; implements continuous batching by immediately filling freed slots rather than waiting for a full batch to complete. | |
# Runs inference busy-loop in isolated process | Central inference orchestrator that pulls from an internal input queue and runs engine steps; isolated from the API server process to avoid GIL contention. |