EVA: Aligning Video World Models with Executable Robot Actions via Inverse Dynamics Rewards
Ruixiang Wang, Qingming Liu, Yueci Deng, Guiliang Liu, Zhen Liu, Kui Jia

TL;DR
This paper introduces EVA, a reinforcement learning framework that aligns video world models with executable robot actions by using inverse dynamics as a reward, improving the physical plausibility and task success of generated videos.
Contribution
EVA leverages inverse dynamics models as a reward signal to train video world models, reducing artifacts and enhancing real robot task performance.
Findings
EVA reduces embodiment artifacts in generated videos.
EVA improves task success rates on RoboTwin and real robots.
EVA aligns visual models with physical constraints effectively.
Abstract
Video generative models are increasingly used as world models for robotics, where a model generates a future visual rollout conditioned on the current observation and task instruction, and an inverse dynamics model (IDM) converts the generated frames into executable robot actions. However, current video world models lack explicit executability constraints. As a result, visually coherent rollouts may still violate rigid-body and kinematic consistency, producing unstable or infeasible control commands when decoded by an IDM. We refer to this mismatch between visual generation and physically executable control as the executability gap. While this gap can be mitigated at inference time using techniques such as rejection sampling, such approaches are inefficient due to the high cost of video generation. In this paper, we leverage the executability gap as a training signal and introduce…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Social Robot Interaction and HRI · Human Motion and Animation
