World models and neural simulators represent a paradigm shift in AI from reactive pattern matching to predictive understanding of how environments evolve over time. Unlike large language models that predict discrete tokens, world models learn continuous dynamics of visual, physical, and interactive worlds, enabling agents to imagine futures, plan actions, and learn from simulated experience rather than costly real-world interactions. At their core, world models compress high-dimensional sensory observations (video, depth, semantics) into compact latent representations and predict how those representations transform under actions or time, effectively building an internal physics engine learned from data. This capability is foundational for robotics (sim-to-real transfer, manipulation planning), autonomous systems (driving simulators, trajectory forecasting), reinforcement learning (sample-efficient policy learning via imagination), and generative AI (controllable video synthesis, interactive 3D environments). A key insight: the challenge isn't just generating realistic pixels—it's maintaining temporal coherence, physical plausibility, and causal consistency across long horizons while remaining computationally tractable for real-time planning and control.
What This Cheat Sheet Covers
This topic spans 20 focused tables and 116 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Foundation World Model Platforms
World foundation models (WFMs) are pre-trained, general-purpose simulators designed to predict spatio-temporal dynamics across diverse visual domains. These platforms combine video generation, physics-aware prediction, and action-conditioned rollouts to serve as infrastructure for robotics, autonomous systems, and embodied AI. Unlike task-specific models, WFMs aim for broad generalization and transfer learning.
| Platform | Example | Description |
|---|---|---|
Cosmos-Predict2.5-14B-Video2WorldCosmos-Transfer2.5-13BCosmos-Reason2 | Family of world foundation models for physical AI including video-to-world generation (Predict), domain adaptation (Transfer), and physical reasoning (Reason). Provides tokenizers, guardrails, and post-training scripts. Newest release (2026) supports multi-view generation, 6-DoF camera control for autonomous vehicle scenarios. • Open-sourced under NVIDIA Open Model License • Used for synthetic data generation in robotics training • Integrates with NVIDIA Omniverse and Isaac Sim | |
text: "futuristic city"→ interactive 3D world @ 24fps | Autoregressive latent diffusion world model generating interactive 3D environments from single images or text prompts. Genie 2 (Dec 2024): 720p real-time generation. Genie 3 (Mar 2026): improved long-term consistency, revisitation memory. • Action-controllable: keyboard/mouse inputs modify generated worlds • Enables prototyping game mechanics without assets • Trained on unlabeled internet video | |
python train_dreamerv3.py--task atari_pong | General model-based RL algorithm using Recurrent State-Space Models (RSSM) to learn world dynamics in latent space. Outperforms specialized methods across 150+ tasks (Atari, DM Control, Minecraft diamond collection). • Novel symlog/symexp activation for stable value prediction • EMA normalization for robustness across domains • Imagination training: policy learns entirely in latent rollouts | |
image + depth → navigable 3D scenepersistent across views | Spatial intelligence system generating persistent 3D worlds from images. Supports camera movement, object permanence, geometric consistency. Marble beta (2025): limited-access API for generating and exploring worlds. • Uses panoramic diffusion + mesh reconstruction • Layer-wise depth alignment for multi-scale geometry • Goal: AI-native content creation for games/simulation |