EA-WM: Event-Aware Generative World Model with Structured Kinematic-to-Visual Action Fields
Zhaoyang Yang, Yurun Jin, Lizhe Qi, Cong Huang, Kai Chen

TL;DR
EA-WM is a novel generative world model that integrates kinematic control with visual perception, enabling more accurate robot simulation and interaction modeling by projecting actions into camera views.
Contribution
It introduces Structured Kinematic-to-Visual Action Fields and event-aware fusion blocks to improve robot-world interaction modeling in video generation.
Findings
EA-WM achieves state-of-the-art results on the WorldArena benchmark.
It better preserves robot spatial geometry and interaction dynamics in generated videos.
The model effectively leverages action signals to guide video synthesis.
Abstract
Pretrained video diffusion models provide powerful spatiotemporal generative priors, making them a natural foundation for robotic world models. While recent world-action models jointly optimize future videos and actions, they predominantly treat video generation as an auxiliary representation for policy learning. Consequently, they insufficiently explore the inverse problem: leveraging action signals to guide video synthesis, thereby often failing to preserve precise robot spatial geometry and fine-grained robot-object interaction dynamics in the generated rollouts. To bridge this gap, we present EA-WM, an Event-Aware Generative World Model that effectively closes the loop between kinematic control and visual perception. Rather than injecting joint or end-effector actions as abstract, low-dimensional tokens, EA-WM projects actions and kinematic states directly into the target camera…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
