World Modeling for Physical AI
From simulation to representation learning, predictive control, and beyond
World models give AI systems the ability to predict how the world changes in response to actions. Interest in these models is surging across AI research, especially in relation to systems that must act in the physical world. Autonomous driving, the most mature domain of Physical AI, already uses forms of world modeling in practice, but the field remains fragmented: some systems model traffic behavior, while others generate photorealistic video, and there is a constant stream of new research exploring novel ways of extracting practical value from these models. In this post, we take a high-level look at the current state of world modeling in autonomous driving, highlighting the gaps between deployed systems, active research, and longer-term ideas.
Looking Back
Autonomous vehicle deployments already rely on a practical form of world modeling, which looks less like open-ended generation and more like classical simulation: reconstructing logged environments, perturbing traffic behavior, and using the resulting scenarios to train or evaluate driving policies.
World Engine is a representative example of this paradigm. It uses 3D Gaussian Splatting (3DGS) to reconstruct driving scenes from logs, and in parallel trains a base driving policy through Imitation Learning (IL) pre-training on large-scale expert demonstrations. Next, a “behavior world model” generates rare safety-critical interactions such as cut-ins and near-misses. These generated scenarios support Reinforcement Learning (RL) post-training, helping deploy a policy for 200 kilometers of real-world driving in Shanghai without any manual interventions. Below are some scenarios encountered during the real-world test: construction zones, occluded pedestrians, rainy weather, and poor lighting, all handled seamlessly by the policy learned end-to-end.
In such work, however, the world model plays a limited role. First, it only contributes to RL post-training, while IL pre-training remains the main lever for improving the driving policy. This keeps the overall system dependent on collecting and curating millions of hours of diverse driving recordings. Second, it remains constrained to an abstract behavioral space, generating entities in the form of bounding boxes and trajectories rather than detailed perceptual observations.
The World Engine simulation stack is also expensive: it requires millions of GPU hours, much of it spent on 3DGS reconstructions. Despite this investment, the resulting RL stage can only make relatively small adjustments to the policy. The core bottleneck is the limited physical extent of reconstruction-based simulation. Reconstructions remain tied to logged scenes and local geometry, typically extending only a few meters around the ego vehicle. This short spatial extent leads to short temporal rollouts, which when combined with the difficulty in designing RL reward functions limits the RL signal and reinforces dependence on IL data. Overall, 3DGS is a meaningful step forward over simpler geometric transformations used for prior deployments, but it falls short of the broader promise of world models: generating controllable, action-conditioned worlds rather than perturbing recorded ones.
Looking Around
The field is now moving beyond reconstructing short logged fragments toward generative world models that synthesize longer, controllable rollouts of plausible sensor data. This shift is already visible across three major autonomous driving organizations that have each recently unveiled production-scale generative simulators: NVIDIA, Wayve and Waymo.
Structurally, these systems share a hierarchical design, similar to World Engine. First, a traffic scenario generator handles agent layout, scene structure, and long-horizon behavior. Second, a diffusion transformer turns this structured state into high-fidelity surround-view sensor observations conditioned on an action sequence.
All three systems point toward the same target: using world models for closed-loop training in simulation. While runtime details are unavailable for GAIA-3 and WaymoWM, OmniDreams achieves an impressive 12 FPS on a single GPU, suggesting that this target may be technically plausible in the near term. If these models become fast and stable enough, the familiar IL-to-RL pipeline can move from abstract traffic simulation into pixel-level simulators: IL pre-training on large-scale driving logs, followed by RL post-training inside the world model.
However, these systems need to improve along several axes, including visual fidelity. Training them remains compute and data intensive, and their usefulness for policy improvement depends on their physical consistency, controllability, latency of interaction, and rendering throughput. Policies trained in simulators will also need to deal with the sim2real gap when deployed in the real world.
Looking Ahead
The systems above extend today’s most direct path: scaling pixel-level generative simulators. A complementary approach is to make the “latent state” encoded in these world models more useful for prediction and control.
Many current world models use encoders primarily for visual compression. For Physical AI, the more useful representation could be a structured latent space that captures geometry, motion, affordances, intent, and causal relationships. This is one of the main challenges in creating more useful world models. I believe auto-labeling training data at scale for world models using other foundation models could help bridge this gap.
Another direction I am particularly excited about is scaling up latent world models of value functions. Instead of generating future observations, these models learn compact latent dynamics together with rewards, values, and sometimes policies. This makes them more than simulators: the same representation they learn can additionally support predictive control at deployment time, as well as perception or data curation tasks. The key difference of this direction from prior IL-to-RL approaches is that RL can now shape the world model while it is being trained. Temporal difference learning pushes the latent space toward capturing task-relevant quantities, instead of treating the world model as a frozen simulator used only after training.
For reliable simulation of latent features, such a model must be trained on diverse and heterogeneous data, including pre-recorded datasets as well as data collected online through interaction in simulated environments. Importantly, by then distilling or fine-tuning the model on a task-specific distribution, the same architecture used for simulation can be deployed as a controller on the physical system, either through model predictive control or a learned policy head.
Illustrated below is an example architecture incorporating these ideas. Stay tuned for more details and a first implementation!









