vLLM (LLM Inference Engine) Cheat Sheet

Updated 2026-05-21

Next Topic: Weaviate (Vector Database) Cheat Sheet

vLLM is an open-source, high-throughput inference and serving engine for large language models, originating from UC Berkeley research. It solves the GPU memory fragmentation problem that plagued earlier serving systems by introducing PagedAttention — an attention algorithm inspired by OS virtual memory paging — which reduces KV-cache memory waste from 60–80% down to under 4%. The key mental model is that vLLM separates the what (the model weights) from the how (memory management and scheduling), letting a single server host many concurrent requests with near-optimal GPU utilization; every optimization — prefix caching, chunked prefill, continuous batching, speculative decoding — ultimately serves this goal of keeping GPU compute saturated while minimizing latency per token.

What This Cheat Sheet Covers

This topic spans 20 focused tables and 163 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Core Architecture ComponentsTable 2: Installation and QuickstartTable 3: SamplingParams and Generation ControlTable 4: Key Engine Arguments (Server Flags)Table 5: Automatic Prefix Caching (APC)Table 6: Continuous Batching and SchedulingTable 7: Parallelism StrategiesTable 8: Quantization MethodsTable 9: LoRA AdaptersTable 10: Speculative DecodingTable 11: Structured Output (Guided Decoding)Table 12: Reasoning Model SupportTable 13: Tool Calling and Function CallingTable 14: Multimodal Models (VLMs)Table 15: API Endpoints and ProtocolsTable 16: CUDA Graphs and CompilationTable 17: Prometheus Metrics and ObservabilityTable 18: Benchmarking ToolsTable 19: Environment VariablesTable 20: V1 Engine Architecture Improvements

Table 1: Core Architecture Components

The vLLM V1 engine uses a multi-process design that bypasses Python's GIL and separates HTTP handling, scheduling, and GPU execution into dedicated processes. Understanding these components is the foundation for tuning, debugging, and extending vLLM.

Component	Example	Description
PagedAttention	`# KV blocks allocated in fixed-size pages`	• Core memory management algorithm • stores KV-cache in non-contiguous fixed-size blocks (pages), eliminating fragmentation and enabling memory sharing between requests with shared prefixes
KVCacheManager	`# Manages GPU memory pool of KVCacheBlock objects`	• Allocates, tracks, and evicts KV cache blocks using a doubly-linked free queue for O(1) operations • uses LRU eviction under memory pressure
Scheduler	`# Enforces max_num_batched_tokens budget`	• Maintains a waiting deque and running list • implements continuous batching by immediately filling freed slots rather than waiting for a full batch to complete
EngineCore	`# Runs inference busy-loop in isolated process`	• Central inference orchestrator that pulls from an internal input queue and runs engine steps • isolated from the API server process to avoid GIL contention

Table 1: Core Architecture Components

Component	Example	Description
PagedAttention	`# KV blocks allocated in fixed-size pages`	• Core memory management algorithm • stores KV-cache in non-contiguous fixed-size blocks (pages), eliminating fragmentation and enabling memory sharing between requests with shared prefixes
KVCacheManager	`# Manages GPU memory pool of KVCacheBlock objects`	• Allocates, tracks, and evicts KV cache blocks using a doubly-linked free queue for O(1) operations • uses LRU eviction under memory pressure
Scheduler	`# Enforces max_num_batched_tokens budget`	• Maintains a waiting deque and running list • implements continuous batching by immediately filling freed slots rather than waiting for a full batch to complete
EngineCore	`# Runs inference busy-loop in isolated process`	• Central inference orchestrator that pulls from an internal input queue and runs engine steps • isolated from the API server process to avoid GIL contention