On-device LLM inference brings large language model capabilities directly to local hardware—laptops, desktops, mobile devices, and edge systems—without requiring cloud connectivity or API calls. This approach delivers zero-latency response, true privacy (data never leaves the device), and cost elimination (no per-token charges), making it essential for privacy-sensitive applications, offline environments, and cost-conscious deployments. The 2026 landscape centers on three core pillars: quantization (reducing model precision to fit consumer hardware), specialized frameworks (Ollama, llama.cpp, MLX), and hardware acceleration (dedicated NPUs, Apple Silicon unified memory, GPU offloading). Understanding quantization formats like GGUF, choosing the right inference runtime, and matching model size to available VRAM determines whether local inference runs at 5 tokens/second or 50.
What This Cheat Sheet Covers
This topic spans 12 focused tables and 101 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Local Inference Frameworks and Runtimes
Popular runtimes for deploying and managing LLMs on local hardware vary in complexity, GPU support, and ecosystem integration.
| Framework | Example | Description |
|---|---|---|
ollama run llama3.1 | CLI-first runtime with one-command model management, OpenAI-compatible API, and automatic GPU detection; dominates macOS/Linux local inference in 2026 with 52M+ monthly downloads. | |
Download model via GUI → Load → Chat | Polished desktop GUI for model discovery, download, and inference; supports GPU offloading slider, per-model default settings, and embedded local server; best newcomer experience. | |
./llama-server -m model.gguf --port 8080 | Low-level C++ inference engine with CPU/GPU/Metal acceleration; foundation of Ollama/LM Studio; offers maximum control and performance tuning for advanced users. | |
mlx_lm.generate(model, prompt) --max-tokens 100 | Native framework for Apple Silicon leveraging unified memory architecture; achieves 143 tok/s on M5 Max with Qwen3-VL-4B; NumPy-like API for researchers. | |
<script src="webllm.js">await engine.reload("Phi-3") | Browser-based inference via WebGPU with hardware acceleration; enables client-side AI without backend; 7B models run at 15-25 tok/s in Chrome/Edge. |