Skip to main content

Menu

LEVEL 0
0/5 XP
HomeAboutTopicsPricingMy VaultStats

Categories

πŸ€– Artificial Intelligence
☁️ Cloud and Infrastructure
πŸ’Ύ Data and Databases
πŸ’Ό Professional Skills
🎯 Programming and Development
πŸ”’ Security and Networking
πŸ“š Specialized Topics
HomeAboutTopicsPricingMy VaultStats
LEVEL 0
0/5 XP
GitHub
Β© 2026 CheatGridβ„’. All rights reserved.
Privacy PolicyTerms of UseAboutContact

Ollama (Local LLM Runtime) Cheat Sheet

Ollama (Local LLM Runtime) Cheat Sheet

Back to Generative AI
Updated 2026-05-21
Next Topic: OpenAI API Cheat Sheet

Ollama is an open-source runtime that packages large language models with their configuration and serves them via a local REST API, letting developers run models like Llama, Mistral, Gemma, and Qwen entirely on their own hardware. It solves the privacy, latency, and cost problems of cloud-hosted LLMs by providing a simple CLI, a Docker-friendly server process, and an OpenAI-compatible API that integrates with existing tooling without code changes. The key mental model: Ollama is a model manager and inference server in one β€” ollama pull downloads, ollama serve exposes port 11434, and every tool that speaks OpenAI's REST dialect can point at http://localhost:11434/v1 and work immediately.

What This Cheat Sheet Covers

This topic spans 16 focused tables and 146 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Core CLI CommandsTable 2: Modelfile InstructionsTable 3: PARAMETER Values (Modelfile & API Options)Table 4: REST API Endpoints (Native)Table 5: OpenAI-Compatible API Endpoints (/v1/*)Table 6: GGUF Quantization LevelsTable 7: GPU Acceleration (CUDA / Metal / ROCm / Vulkan)Table 8: Environment VariablesTable 9: Model Library β€” Key FamiliesTable 10: Multimodal / Vision CapabilitiesTable 11: Thinking / Reasoning ModeTable 12: Python SDK UsageTable 13: Structured Outputs & Tool CallingTable 14: Embeddings & RAG IntegrationTable 15: Importing Custom Models (GGUF / Safetensors)Table 16: Key Integrations (Open WebUI, Continue, LangChain, LlamaIndex)

Table 1: Core CLI Commands

Every workflow starts at the command line. These commands cover the full lifecycle of a model β€” downloading, running, inspecting, and removing it β€” and are the first things to learn before touching the API or Modelfiles.

CommandExampleDescription
ollama pull
ollama pull llama3.2
ollama pull llama3.2:3b
Downloads a model (and specific tag/size) from the Ollama registry into local storage.
ollama run
ollama run llama3.2
ollama run gemma3 "Why is the sky blue?"
Pulls (if needed) then launches an interactive chat session, or runs a one-shot prompt when text is supplied as argument.
ollama list (ollama ls)
ollama list
Shows all locally downloaded models with NAME, ID, SIZE, and MODIFIED columns.
ollama show
ollama show llama3.2
Displays model metadata: architecture, parameters, template, system prompt, and license.
ollama ps
ollama ps
Lists currently loaded models with their VRAM/RAM footprint β€” useful to diagnose memory pressure.
ollama stop
ollama stop llama3.2
Immediately unloads a running model from memory without waiting for the keep-alive timer.
ollama rm
ollama rm llama3.2
Permanently removes a model from local storage.

More in Generative AI

  • NL-to-SQL and Text-to-Code Generation Cheat Sheet
  • OpenAI API Cheat Sheet
  • Advanced RAG Patterns and Optimization Cheat Sheet
  • ColBERT and Late Interaction Retrieval Cheat Sheet
  • LangSmith Cheat Sheet
  • pgvector for Postgres Vector Search Cheat Sheet
View all 95 topics in Generative AI