Modern reasoning-optimized LLMs allocate additional inference compute to deliberate on complex problems before generating answers, a paradigm known as test-time compute scaling. These modelsβpioneered by OpenAI's o1/o3 series and DeepSeek-R1βgenerate extended internal thinking traces during inference, enabling them to solve graduate-level math (AIME), scientific reasoning (GPQA Diamond), and code generation (Codeforces) tasks that stump standard LLMs. Unlike traditional pretraining scaling (more parameters, more data), test-time scaling improves accuracy by investing compute at inference time, turning reasoning into a search problem over possible solution paths. The key insight: spending more time thinking often outperforms adding more training data, opening a new axis of performance improvement orthogonal to model size.
What This Cheat Sheet Covers
This topic spans 14 focused tables and 90 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Core Reasoning Architectures
Reasoning models learn to generate step-by-step solution paths during training via reinforcement learning signals, then spend variable inference compute exploring and refining those paths at test time. OpenAI's reasoning models hide intermediate thinking tokens from users, while DeepSeek-R1 exposes full reasoning traces, offering interpretability at the cost of privacy.
| Model | Example | Description |
|---|---|---|
AIME 2024: 74.3%GPQA Diamond: 78.3% | First production reasoning model from OpenAI; generates hidden chain-of-thought tokens during inference; trained with RL on verifiable tasks; significantly outperforms GPT-4o on reasoning benchmarks; thinking process not shown to users. | |
ARC-AGI: 87.5% (high)AIME 2024: 96.7% (medium) | Most capable reasoning model as of May 2026; introduces reasoning_effort parameter (low, medium, high) to control compute allocation; scored 135 on Mensa IQ test; sets new SOTA across coding, math, science, and visual perception. | |
Codeforces ELO: 2719SWE-Bench Verified: 68.1% | Cost-efficient reasoning model; slightly outperforms o3 on code tasks; optimized for high throughput with lower latency; adaptive thinking budget balances speed and accuracy; ideal for production deployments. | |
AIME 2024: 79.8%Codeforces: 92nd percentile | Open-weights reasoning model trained via RLVR + GRPO; generates visible reasoning traces in natural language; comparable to o1 on benchmarks; full technical report and weights publicly available. |