Ollama is an open-source runtime that packages large language models with their configuration and serves them via a local REST API, letting developers run models like Llama, Mistral, Gemma, and Qwen entirely on their own hardware. It solves the privacy, latency, and cost problems of cloud-hosted LLMs by providing a simple CLI, a Docker-friendly server process, and an OpenAI-compatible API that integrates with existing tooling without code changes. The key mental model: Ollama is a model manager and inference server in one β ollama pull downloads, ollama serve exposes port 11434, and every tool that speaks OpenAI's REST dialect can point at http://localhost:11434/v1 and work immediately.
What This Cheat Sheet Covers
This topic spans 16 focused tables and 146 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Core CLI Commands
Every workflow starts at the command line. These commands cover the full lifecycle of a model β downloading, running, inspecting, and removing it β and are the first things to learn before touching the API or Modelfiles.
| Command | Example | Description |
|---|---|---|
ollama pull llama3.2ollama pull llama3.2:3b | Downloads a model (and specific tag/size) from the Ollama registry into local storage. | |
ollama run llama3.2ollama run gemma3 "Why is the sky blue?" | Pulls (if needed) then launches an interactive chat session, or runs a one-shot prompt when text is supplied as argument. | |
ollama list | Shows all locally downloaded models with NAME, ID, SIZE, and MODIFIED columns. | |
ollama show llama3.2 | Displays model metadata: architecture, parameters, template, system prompt, and license. | |
ollama ps | Lists currently loaded models with their VRAM/RAM footprint β useful to diagnose memory pressure. | |
ollama stop llama3.2 | Immediately unloads a running model from memory without waiting for the keep-alive timer. | |
ollama rm llama3.2 | Permanently removes a model from local storage. |