World Models and Neural Simulators Cheat Sheet

Updated 2026-05-19

World models and neural simulators represent a paradigm shift in AI from reactive pattern matching to predictive understanding of how environments evolve over time. Unlike large language models that predict discrete tokens, world models learn continuous dynamics of visual, physical, and interactive worlds, enabling agents to imagine futures, plan actions, and learn from simulated experience rather than costly real-world interactions. At their core, world models compress high-dimensional sensory observations (video, depth, semantics) into compact latent representations and predict how those representations transform under actions or time, effectively building an internal physics engine learned from data. This capability is foundational for robotics (sim-to-real transfer, manipulation planning), autonomous systems (driving simulators, trajectory forecasting), reinforcement learning (sample-efficient policy learning via imagination), and generative AI (controllable video synthesis, interactive 3D environments). A key insight: the challenge isn't just generating realistic pixels—it's maintaining temporal coherence, physical plausibility, and causal consistency across long horizons while remaining computationally tractable for real-time planning and control.

What This Cheat Sheet Covers

This topic spans 20 focused tables and 116 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Foundation World Model PlatformsTable 2: World Model Architectural ComponentsTable 3: Video Tokenization and CompressionTable 4: Latent Dynamics Modeling ApproachesTable 5: World Model Training ObjectivesTable 6: Planning and Control with World ModelsTable 7: Sim-to-Real Transfer for RoboticsTable 8: 4D Scene ReconstructionTable 9: Evaluation Metrics for World ModelsTable 10: Temporal Coherence and Consistency TechniquesTable 11: Action Conditioning and ControlTable 12: Object-Centric and Scene DecompositionTable 13: Self-Supervised Pre-Training and Foundation ModelsTable 14: Model Predictive Control and Sampling-Based PlanningTable 15: World Model Architectures for Reinforcement LearningTable 16: Handling Stochasticity and Partial ObservabilityTable 17: Exploration and Curiosity in World Model LearningTable 18: Video Diffusion and Autoregressive GenerationTable 19: Multi-Modal and Language-Conditioned World ModelsTable 20: Inference Optimization and Deployment

Table 1: Foundation World Model Platforms

World foundation models (WFMs) are pre-trained, general-purpose simulators designed to predict spatio-temporal dynamics across diverse visual domains. These platforms combine video generation, physics-aware prediction, and action-conditioned rollouts to serve as infrastructure for robotics, autonomous systems, and embodied AI. Unlike task-specific models, WFMs aim for broad generalization and transfer learning.

Platform	Example	Description
NVIDIA Cosmos	`Cosmos-Predict2.5-14B-Video2World` `Cosmos-Transfer2.5-13B` `Cosmos-Reason2`	• Family of world foundation models for physical AI including video-to-world generation (Predict), domain adaptation (Transfer), and physical reasoning (Reason). Provides tokenizers, guardrails, and post-training scripts. Newest release (2026) supports multi-view generation, 6-DoF camera control for autonomous vehicle scenarios. • Open-sourced under NVIDIA Open Model License • Used for synthetic data generation in robotics training • Integrates with NVIDIA Omniverse and Isaac Sim
Google Genie 2/3	`text: "futuristic city"` `→ interactive 3D world @ 24fps`	• Autoregressive latent diffusion world model generating interactive 3D environments from single images or text prompts. Genie 2 (Dec 2024): 720p real-time generation. Genie 3 (Mar 2026): improved long-term consistency, revisitation memory. • Action-controllable: keyboard/mouse inputs modify generated worlds • Enables prototyping game mechanics without assets • Trained on unlabeled internet video
Dreamer V3	`python train_dreamerv3.py` `--task atari_pong`	• General model-based RL algorithm using Recurrent State-Space Models (RSSM) to learn world dynamics in latent space. Outperforms specialized methods across 150+ tasks (Atari, DM Control, Minecraft diamond collection). • Novel symlog/symexp activation for stable value prediction • EMA normalization for robustness across domains • Imagination training: policy learns entirely in latent rollouts
Meta WorldGen / Marble	`image + depth → navigable 3D scene` `persistent across views`	• Spatial intelligence system generating persistent 3D worlds from images. Supports camera movement, object permanence, geometric consistency. Marble beta (2025): limited-access API for generating and exploring worlds. • Uses panoramic diffusion + mesh reconstruction • Layer-wise depth alignment for multi-scale geometry • Goal: AI-native content creation for games/simulation

Table 1: Foundation World Model Platforms

Platform	Example	Description
NVIDIA Cosmos	`Cosmos-Predict2.5-14B-Video2World` `Cosmos-Transfer2.5-13B` `Cosmos-Reason2`	• Family of world foundation models for physical AI including video-to-world generation (Predict), domain adaptation (Transfer), and physical reasoning (Reason). Provides tokenizers, guardrails, and post-training scripts. Newest release (2026) supports multi-view generation, 6-DoF camera control for autonomous vehicle scenarios. • Open-sourced under NVIDIA Open Model License • Used for synthetic data generation in robotics training • Integrates with NVIDIA Omniverse and Isaac Sim
Google Genie 2/3	`text: "futuristic city"` `→ interactive 3D world @ 24fps`	• Autoregressive latent diffusion world model generating interactive 3D environments from single images or text prompts. Genie 2 (Dec 2024): 720p real-time generation. Genie 3 (Mar 2026): improved long-term consistency, revisitation memory. • Action-controllable: keyboard/mouse inputs modify generated worlds • Enables prototyping game mechanics without assets • Trained on unlabeled internet video
Dreamer V3	`python train_dreamerv3.py` `--task atari_pong`	• General model-based RL algorithm using Recurrent State-Space Models (RSSM) to learn world dynamics in latent space. Outperforms specialized methods across 150+ tasks (Atari, DM Control, Minecraft diamond collection). • Novel symlog/symexp activation for stable value prediction • EMA normalization for robustness across domains • Imagination training: policy learns entirely in latent rollouts
Meta WorldGen / Marble	`image + depth → navigable 3D scene` `persistent across views`	• Spatial intelligence system generating persistent 3D worlds from images. Supports camera movement, object permanence, geometric consistency. Marble beta (2025): limited-access API for generating and exploring worlds. • Uses panoramic diffusion + mesh reconstruction • Layer-wise depth alignment for multi-scale geometry • Goal: AI-native content creation for games/simulation