On-Device LLM Inference Cheat Sheet

Updated 2026-05-18

Next Topic: Online Learning and Concept Drift Adaptation Cheat Sheet

On-device LLM inference brings large language model capabilities directly to local hardware—laptops, desktops, mobile devices, and edge systems—without requiring cloud connectivity or API calls. This approach delivers zero-latency response, true privacy (data never leaves the device), and cost elimination (no per-token charges), making it essential for privacy-sensitive applications, offline environments, and cost-conscious deployments. The 2026 landscape centers on three core pillars: quantization (reducing model precision to fit consumer hardware), specialized frameworks (Ollama, llama.cpp, MLX), and hardware acceleration (dedicated NPUs, Apple Silicon unified memory, GPU offloading). Understanding quantization formats like GGUF, choosing the right inference runtime, and matching model size to available VRAM determines whether local inference runs at 5 tokens/second or 50.

What This Cheat Sheet Covers

This topic spans 12 focused tables and 101 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Local Inference Frameworks and RuntimesTable 2: Quantization Formats and Precision LevelsTable 3: Ollama Command-Line OperationsTable 4: LM Studio GUI Workflows and ConfigurationTable 5: llama.cpp Quantization and ConversionTable 6: Hardware Requirements and VRAM AllocationTable 7: Privacy and Security BenefitsTable 8: Apple MLX Framework AccelerationTable 9: Mobile and Edge DeploymentTable 10: Browser-Based Inference with WebLLM and Transformers.jsTable 11: Deployment Patterns and Production ArchitecturesTable 12: Optimization Techniques and Performance Tuning

Table 1: Local Inference Frameworks and Runtimes

Popular runtimes for deploying and managing LLMs on local hardware vary in complexity, GPU support, and ecosystem integration.

Framework	Example	Description
Ollama	`ollama run llama3.1`	• CLI-first runtime with one-command model management, OpenAI-compatible API, and automatic GPU detection • dominates macOS/Linux local inference in 2026 with 52M+ monthly downloads
LM Studio	Download model via GUI → Load → Chat	• Polished desktop GUI for model discovery, download, and inference • supports GPU offloading slider, per-model default settings, and embedded local server • best newcomer experience
llama.cpp	`./llama-server -m model.gguf` `--port 8080`	• Low-level C++ inference engine with CPU/GPU/Metal acceleration • foundation of Ollama/LM Studio • offers maximum control and performance tuning for advanced users
Apple MLX	`mlx_lm.generate(model, prompt)` `--max-tokens 100`	• Native framework for Apple Silicon leveraging unified memory architecture • achieves 143 tok/s on M5 Max with Qwen3-VL-4B • NumPy-like API for researchers
WebLLM	`<script src="webllm.js">` `await engine.reload("Phi-3")`	• Browser-based inference via WebGPU with hardware acceleration • enables client-side AI without backend • 7B models run at 15-25 tok/s in Chrome/Edge

Table 1: Local Inference Frameworks and Runtimes

Popular runtimes for deploying and managing LLMs on local hardware vary in complexity, GPU support, and ecosystem integration.

Framework	Example	Description
Ollama	`ollama run llama3.1`	• CLI-first runtime with one-command model management, OpenAI-compatible API, and automatic GPU detection • dominates macOS/Linux local inference in 2026 with 52M+ monthly downloads
LM Studio	Download model via GUI → Load → Chat	• Polished desktop GUI for model discovery, download, and inference • supports GPU offloading slider, per-model default settings, and embedded local server • best newcomer experience
llama.cpp	`./llama-server -m model.gguf` `--port 8080`	• Low-level C++ inference engine with CPU/GPU/Metal acceleration • foundation of Ollama/LM Studio • offers maximum control and performance tuning for advanced users
Apple MLX	`mlx_lm.generate(model, prompt)` `--max-tokens 100`	• Native framework for Apple Silicon leveraging unified memory architecture • achieves 143 tok/s on M5 Max with Qwen3-VL-4B • NumPy-like API for researchers
WebLLM	`<script src="webllm.js">` `await engine.reload("Phi-3")`	• Browser-based inference via WebGPU with hardware acceleration • enables client-side AI without backend • 7B models run at 15-25 tok/s in Chrome/Edge