Physical AI and Robotics AI Cheat Sheet

Updated 2026-05-18

Physical AI, also called embodied intelligence, represents the integration of artificial intelligence into physical systems that perceive, reason, and interact with the real world. Unlike traditional AI that processes abstract data, Physical AI encompasses robots, autonomous vehicles, humanoid systems, and smart devices that must understand physics, manipulate objects, navigate spaces, and collaborate safely with humans. The field has experienced unprecedented growth in 2025-2026, driven by vision-language-action (VLA) foundation models, world simulators, and sim-to-real transfer breakthroughs. Modern Physical AI systems combine multi-modal perception (vision, tactile, force), foundation model reasoning, and learned motor policies to achieve general-purpose manipulation capabilities previously requiring task-specific programming. This shift from narrow automation to generalist robot intelligence marks a fundamental transformation in how machines learn to act in unstructured environments.

What This Cheat Sheet Covers

This topic spans 16 focused tables and 85 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Foundation Models for RoboticsTable 2: World Simulators and Synthetic DataTable 3: Sim-to-Real Transfer TechniquesTable 4: Visual-Motor Policy LearningTable 5: Action Representation and TokenizationTable 6: Reinforcement Learning for ManipulationTable 7: Imitation Learning and Behavior CloningTable 8: Sensor Fusion and PerceptionTable 9: Robot Manipulation PrimitivesTable 10: Humanoid Robot Systems and Whole-Body ControlTable 11: Benchmark Tasks and EvaluationTable 12: Hardware Platforms and Edge ComputingTable 13: Robot Navigation and SLAMTable 14: Training Data and TeleoperationTable 15: Safety, Robustness, and Human-Robot InteractionTable 16: Software Frameworks and Middleware

Table 1: Foundation Models for Robotics

Foundation models pre-trained on massive datasets enable robots to generalize across tasks, embodiments, and environments without task-specific retraining. These models—particularly vision-language-action (VLA) architectures—learn from Internet-scale vision-language data, robot trajectory datasets, and simulation, producing policies that can zero-shot transfer to novel manipulation scenarios. The 2025-2026 period saw rapid convergence on VLA as the dominant paradigm, with models scaling from millions to billions of parameters. Unlike classical robotics that required explicit programming for each task, foundation models discover manipulation strategies through pattern recognition across diverse demonstrations, enabling emergence of general-purpose robot intelligence.

Model	Example	Description
Gemini Robotics	Bi-arm robots with on-device VLA	Google DeepMind's VLA model for robots of any shape, enabling perception, reasoning, tool use, and human interaction using a SOTA LLM adapted for robotic control without significant architecture changes
OpenVLA	Open-source 7B parameter VLA	• Open-weight vision-language-action model with vision encoder and action head supporting different robot embodiments • retunable and downloadable for research and commercial use
RT-X	Cross-embodiment policy transfer	• Trained on Open X-Embodiment dataset (1M+ trajectories, 22 robots, 527 skills from 21 institutions) • demonstrates positive transfer improving multiple robots' capabilities through shared learning
Physical Intelligence π0	π0 0.7 steerable model (April 2026)	• General-purpose robot foundation model that learns from Internet-scale vision-language pre-training, open-source datasets, and proprietary dexterous tasks from 8 robots • outputs low-level motor commands directly

Table 1: Foundation Models for Robotics

Model	Example	Description
Gemini Robotics	Bi-arm robots with on-device VLA	Google DeepMind's VLA model for robots of any shape, enabling perception, reasoning, tool use, and human interaction using a SOTA LLM adapted for robotic control without significant architecture changes
OpenVLA	Open-source 7B parameter VLA	• Open-weight vision-language-action model with vision encoder and action head supporting different robot embodiments • retunable and downloadable for research and commercial use
RT-X	Cross-embodiment policy transfer	• Trained on Open X-Embodiment dataset (1M+ trajectories, 22 robots, 527 skills from 21 institutions) • demonstrates positive transfer improving multiple robots' capabilities through shared learning
Physical Intelligence π0	π0 0.7 steerable model (April 2026)	• General-purpose robot foundation model that learns from Internet-scale vision-language pre-training, open-source datasets, and proprietary dexterous tasks from 8 robots • outputs low-level motor commands directly