Token Management Cheat Sheet

Updated 2026-04-28

Next Topic: Transformer Architecture Cheat Sheet

Token management is the discipline of controlling, optimizing, and tracking how Large Language Models (LLMs) consume input and output tokens—the fundamental units text is broken into for processing. With tokens driving both API costs (often $1-$ 30 per million output tokens in 2026) and context window limits (up to 2M tokens), effective token management determines whether AI applications run efficiently or exhaust budgets. Key insight: output tokens cost 3–5× more than input tokens, reasoning models generate thousands of hidden thinking tokens billed at output rates, agentic workflows burn 10–100× more tokens than simple API calls, and even tool schema definitions can silently consume 55K–134K tokens before any work begins—making optimization a production necessity at every layer of your stack.

What This Cheat Sheet Covers

This topic spans 17 focused tables and 117 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Context Window FundamentalsTable 2: Tokenization AlgorithmsTable 3: Token Counting and EstimationTable 4: Rate Limits and Usage TiersTable 5: Cost Optimization TechniquesTable 6: Context Management StrategiesTable 7: Chunking Strategies for RAGTable 8: Generation ParametersTable 9: Thinking Budget ControlsTable 10: Inference Acceleration TechniquesTable 11: Attention Mechanisms for Long ContextTable 12: Caching and ReuseTable 13: Cost Tracking and ObservabilityTable 14: API Pricing Models (2026 Snapshot)Table 15: Advanced Optimization PatternsTable 16: Streaming and Real-Time ConsiderationsTable 17: Context Extension Techniques

Table 1: Context Window Fundamentals

Before you can optimize token usage you have to know what you're actually paying for, and these are the building blocks—input, output, reasoning, and cached tokens—plus the window that caps them all. The asymmetry is the thing to internalize: output costs several times more than input, hidden reasoning tokens bill at output rates, and stuffing the window too full quietly degrades quality long before you hit a hard error.

Concept	Example	Description
Context Window	GPT-5.4: 128K tokens Claude Opus 4.7: 1M tokens Gemini 3.1 Pro: 2M tokens	• The maximum total tokens (input + output) a model can process in a single request • exceeding this causes truncation or errors.
Input Token	User prompt: 500 tokens System instructions: 200 tokens Retrieved documents: 3000 tokens	• Tokens sent to the model (prompts, context, examples) • typically cost $0.50-$ 5 per million tokens depending on model and tier.
Output Token	Model response: 800 tokens	• Tokens generated by the model • cost 3–5× more than input ( $1.50-$ 30 per million tokens).
Reasoning Token	Claude Opus 4.6 thinking: 8000–32000 hidden tokens o3 reasoning: varies widely per call	• Hidden tokens generated by reasoning models (Claude, o3, Gemini, DeepSeek R1) during internal deliberation • billed at output rates; same prompt can produce vastly different thinking token counts across calls.
Cached Input Token	Reused system prompt: 1200 tokens Cost: $0.30/1M (vs$ 3.00 uncached)	• Tokens from a reusable prompt prefix stored in KV cache • cost 90% less than fresh input tokens and reduce TTFT latency.

Table 1: Context Window Fundamentals

Concept	Example	Description
Context Window	GPT-5.4: 128K tokens Claude Opus 4.7: 1M tokens Gemini 3.1 Pro: 2M tokens	• The maximum total tokens (input + output) a model can process in a single request • exceeding this causes truncation or errors.
Input Token	User prompt: 500 tokens System instructions: 200 tokens Retrieved documents: 3000 tokens	• Tokens sent to the model (prompts, context, examples) • typically cost $0.50-$ 5 per million tokens depending on model and tier.
Output Token	Model response: 800 tokens	• Tokens generated by the model • cost 3–5× more than input ( $1.50-$ 30 per million tokens).
Reasoning Token	Claude Opus 4.6 thinking: 8000–32000 hidden tokens o3 reasoning: varies widely per call	• Hidden tokens generated by reasoning models (Claude, o3, Gemini, DeepSeek R1) during internal deliberation • billed at output rates; same prompt can produce vastly different thinking token counts across calls.
Cached Input Token	Reused system prompt: 1200 tokens Cost: $0.30/1M (vs$ 3.00 uncached)	• Tokens from a reusable prompt prefix stored in KV cache • cost 90% less than fresh input tokens and reduce TTFT latency.