Token management is the discipline of controlling, optimizing, and tracking how Large Language Models (LLMs) consume input and output tokens—the fundamental units text is broken into for processing. With tokens driving both API costs (often 1–30 per million output tokens in 2026) and context window limits (up to 2M tokens), effective token management determines whether AI applications run efficiently or exhaust budgets. Key insight: output tokens cost 3–5× more than input tokens, reasoning models generate thousands of hidden thinking tokens billed at output rates, agentic workflows burn 10–100× more tokens than simple API calls, and even tool schema definitions can silently consume 55K–134K tokens before any work begins—making optimization a production necessity at every layer of your stack.
What This Cheat Sheet Covers
This topic spans 17 focused tables and 117 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Context Window Fundamentals
| Concept | Example | Description |
|---|---|---|
GPT-5.4: 128K tokens Claude Opus 4.7: 1M tokens Gemini 3.1 Pro: 2M tokens | • The maximum total tokens (input + output) a model can process in a single request • exceeding this causes truncation or errors. | |
User prompt: 500 tokens System instructions: 200 tokens Retrieved documents: 3000 tokens | • Tokens sent to the model (prompts, context, examples) • typically cost 0.50–5 per million tokens depending on model and tier. | |
Model response: 800 tokens | • Tokens generated by the model • cost 3–5× more than input ( 1.50–30 per million tokens). | |
Claude Opus 4.6 thinking: 8000–32000 hidden tokens o3 reasoning: varies widely per call | • Hidden tokens generated by reasoning models (Claude, o3, Gemini, DeepSeek R1) during internal deliberation • billed at output rates; same prompt can produce vastly different thinking token counts across calls. | |
Reused system prompt: 1200 tokens Cost: 0.30/1M (vs 3.00 uncached) | • Tokens from a reusable prompt prefix stored in KV cache • cost 90% less than fresh input tokens and reduce TTFT latency. |