LLM Reasoning and Test-Time Compute Scaling Cheat Sheet

Updated 2026-05-18

Next Topic: LLM Security & Safety Cheat Sheet

Modern reasoning-optimized LLMs allocate additional inference compute to deliberate on complex problems before generating answers, a paradigm known as test-time compute scaling. These models—pioneered by OpenAI's o1/o3 series and DeepSeek-R1—generate extended internal thinking traces during inference, enabling them to solve graduate-level math (AIME), scientific reasoning (GPQA Diamond), and code generation (Codeforces) tasks that stump standard LLMs. Unlike traditional pretraining scaling (more parameters, more data), test-time scaling improves accuracy by investing compute at inference time, turning reasoning into a search problem over possible solution paths. The key insight: spending more time thinking often outperforms adding more training data, opening a new axis of performance improvement orthogonal to model size.

What This Cheat Sheet Covers

This topic spans 14 focused tables and 90 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Core Reasoning ArchitecturesTable 2: Training AlgorithmsTable 3: Test-Time Compute StrategiesTable 4: Reasoning Trace GenerationTable 5: Evaluation BenchmarksTable 6: Scaling Laws and Compute AllocationTable 7: Decoding and Sampling ParametersTable 8: Prompting Techniques for Reasoning ModelsTable 9: Reward Modeling and VerificationTable 10: Implementation PatternsTable 11: Failure Modes and LimitationsTable 12: Architecture and Training ComponentsTable 13: Benchmarking and Evaluation PracticesTable 14: Emerging Research Directions

Table 1: Core Reasoning Architectures

Reasoning models learn to generate step-by-step solution paths during training via reinforcement learning signals, then spend variable inference compute exploring and refining those paths at test time. OpenAI's reasoning models hide intermediate thinking tokens from users, while DeepSeek-R1 exposes full reasoning traces, offering interpretability at the cost of privacy.

Model	Example	Description
OpenAI o1	`AIME 2024: 74.3%` `GPQA Diamond: 78.3%`	• First production reasoning model from OpenAI • generates hidden chain-of-thought tokens during inference • trained with RL on verifiable tasks • significantly outperforms GPT-4o on reasoning benchmarks • thinking process not shown to users
OpenAI o3	`ARC-AGI: 87.5% (high)` `AIME 2024: 96.7% (medium)`	• Most capable reasoning model as of May 2026 • introduces reasoning_effort parameter (`low`, `medium`, `high`) to control compute allocation • scored 135 on Mensa IQ test • sets new SOTA across coding, math, science, and visual perception
OpenAI o4-mini	`Codeforces ELO: 2719` `SWE-Bench Verified: 68.1%`	• Cost-efficient reasoning model • slightly outperforms o3 on code tasks • optimized for high throughput with lower latency • adaptive thinking budget balances speed and accuracy • ideal for production deployments
DeepSeek-R1	`AIME 2024: 79.8%` `Codeforces: 92nd percentile`	• Open-weights reasoning model trained via RLVR + GRPO • generates visible reasoning traces in natural language • comparable to o1 on benchmarks • full technical report and weights publicly available

Table 1: Core Reasoning Architectures

Model	Example	Description
OpenAI o1	`AIME 2024: 74.3%` `GPQA Diamond: 78.3%`	• First production reasoning model from OpenAI • generates hidden chain-of-thought tokens during inference • trained with RL on verifiable tasks • significantly outperforms GPT-4o on reasoning benchmarks • thinking process not shown to users
OpenAI o3	`ARC-AGI: 87.5% (high)` `AIME 2024: 96.7% (medium)`	• Most capable reasoning model as of May 2026 • introduces reasoning_effort parameter (`low`, `medium`, `high`) to control compute allocation • scored 135 on Mensa IQ test • sets new SOTA across coding, math, science, and visual perception
OpenAI o4-mini	`Codeforces ELO: 2719` `SWE-Bench Verified: 68.1%`	• Cost-efficient reasoning model • slightly outperforms o3 on code tasks • optimized for high throughput with lower latency • adaptive thinking budget balances speed and accuracy • ideal for production deployments
DeepSeek-R1	`AIME 2024: 79.8%` `Codeforces: 92nd percentile`	• Open-weights reasoning model trained via RLVR + GRPO • generates visible reasoning traces in natural language • comparable to o1 on benchmarks • full technical report and weights publicly available