Ollama (Local LLM Runtime) Cheat Sheet

Updated 2026-05-21

Ollama is an open-source runtime that packages large language models with their configuration and serves them via a local REST API, letting developers run models like Llama, Mistral, Gemma, and Qwen entirely on their own hardware. It solves the privacy, latency, and cost problems of cloud-hosted LLMs by providing a simple CLI, a Docker-friendly server process, and an OpenAI-compatible API that integrates with existing tooling without code changes. The key mental model: Ollama is a model manager and inference server in one — ollama pull downloads, ollama serve exposes port 11434, and every tool that speaks OpenAI's REST dialect can point at http://localhost:11434/v1 and work immediately.

What This Cheat Sheet Covers

This topic spans 16 focused tables and 146 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Core CLI CommandsTable 2: Modelfile InstructionsTable 3: PARAMETER Values (Modelfile & API Options)Table 4: REST API Endpoints (Native)Table 5: OpenAI-Compatible API Endpoints (/v1/*)Table 6: GGUF Quantization LevelsTable 7: GPU Acceleration (CUDA / Metal / ROCm / Vulkan)Table 8: Environment VariablesTable 9: Model Library — Key FamiliesTable 10: Multimodal / Vision CapabilitiesTable 11: Thinking / Reasoning ModeTable 12: Python SDK UsageTable 13: Structured Outputs & Tool CallingTable 14: Embeddings & RAG IntegrationTable 15: Importing Custom Models (GGUF / Safetensors)Table 16: Key Integrations (Open WebUI, Continue, LangChain, LlamaIndex)

Table 1: Core CLI Commands

Every workflow starts at the command line. These commands cover the full lifecycle of a model — downloading, running, inspecting, and removing it — and are the first things to learn before touching the API or Modelfiles.

Command	Example	Description
ollama pull	`ollama pull llama3.2` `ollama pull llama3.2:3b`	Downloads a model (and specific tag/size) from the Ollama registry into local storage.
ollama run	`ollama run llama3.2` `ollama run gemma3 "Why is the sky blue?"`	Pulls (if needed) then launches an interactive chat session, or runs a one-shot prompt when text is supplied as argument.
ollama list (ollama ls)	`ollama list`	Shows all locally downloaded models with NAME, ID, SIZE, and MODIFIED columns.
ollama show	`ollama show llama3.2`	Displays model metadata: architecture, parameters, template, system prompt, and license.
ollama ps	`ollama ps`	Lists currently loaded models with their VRAM/RAM footprint — useful to diagnose memory pressure.
ollama stop	`ollama stop llama3.2`	Immediately unloads a running model from memory without waiting for the keep-alive timer.
ollama rm	`ollama rm llama3.2`	Permanently removes a model from local storage.

Table 1: Core CLI Commands

Command	Example	Description
ollama pull	`ollama pull llama3.2` `ollama pull llama3.2:3b`	Downloads a model (and specific tag/size) from the Ollama registry into local storage.
ollama run	`ollama run llama3.2` `ollama run gemma3 "Why is the sky blue?"`	Pulls (if needed) then launches an interactive chat session, or runs a one-shot prompt when text is supplied as argument.
ollama list (ollama ls)	`ollama list`	Shows all locally downloaded models with NAME, ID, SIZE, and MODIFIED columns.
ollama show	`ollama show llama3.2`	Displays model metadata: architecture, parameters, template, system prompt, and license.
ollama ps	`ollama ps`	Lists currently loaded models with their VRAM/RAM footprint — useful to diagnose memory pressure.
ollama stop	`ollama stop llama3.2`	Immediately unloads a running model from memory without waiting for the keep-alive timer.
ollama rm	`ollama rm llama3.2`	Permanently removes a model from local storage.