Persistent Robot World Models: Stabilizing Multi-Step Rollouts via Reinforcement Learning
Jai Bardhan, Patrik Drozdik, Josef Sivic, Vladimir Petrik

TL;DR
This paper introduces a reinforcement learning-based post-training method to improve the long-term stability and visual fidelity of robot world models, enabling more accurate multi-step video predictions in manipulation tasks.
Contribution
It presents a novel RL training scheme, a multi-future comparison protocol, and multi-view fidelity rewards to enhance autoregressive robot world models.
Findings
Achieved state-of-the-art rollout fidelity on DROID dataset.
Reduced LPIPS by 14% and improved SSIM by 9.1%.
Outperformed baselines in human preference studies.
Abstract
Action-conditioned robot world models generate future video frames of the manipulated scene given a robot action sequence, offering a promising alternative for simulating tasks that are difficult to model with traditional physics engines. However, these models are optimized for short-term prediction and break down when deployed autoregressively: each predicted clip feeds back as context for the next, causing errors to compound and visual quality to rapidly degrade. We address this through the following contributions. First, we introduce a reinforcement learning (RL) post-training scheme that trains the world model on its own autoregressive rollouts rather than on ground-truth histories. We achieve this by adapting a recent contrastive RL objective for diffusion models to our setting and show that its convergence guarantees carry over exactly. Second, we design a training protocol that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition · Advanced Vision and Imaging
