Persistent Robot World Models: Stabilizing Multi-Step Rollouts via Reinforcement Learning

Jai Bardhan; Patrik Drozdik; Josef Sivic; Vladimir Petrik

arXiv:2603.25685·cs.RO·March 27, 2026

Persistent Robot World Models: Stabilizing Multi-Step Rollouts via Reinforcement Learning

Jai Bardhan, Patrik Drozdik, Josef Sivic, Vladimir Petrik

PDF

Open Access 1 Models

TL;DR

This paper introduces a reinforcement learning-based post-training method to improve the long-term stability and visual fidelity of robot world models, enabling more accurate multi-step video predictions in manipulation tasks.

Contribution

It presents a novel RL training scheme, a multi-future comparison protocol, and multi-view fidelity rewards to enhance autoregressive robot world models.

Findings

01

Achieved state-of-the-art rollout fidelity on DROID dataset.

02

Reduced LPIPS by 14% and improved SSIM by 9.1%.

03

Outperformed baselines in human preference studies.

Abstract

Action-conditioned robot world models generate future video frames of the manipulated scene given a robot action sequence, offering a promising alternative for simulating tasks that are difficult to model with traditional physics engines. However, these models are optimized for short-term prediction and break down when deployed autoregressively: each predicted clip feeds back as context for the next, causing errors to compound and visual quality to rapidly degrade. We address this through the following contributions. First, we introduce a reinforcement learning (RL) post-training scheme that trains the world model on its own autoregressive rollouts rather than on ground-truth histories. We achieve this by adapting a recent contrastive RL objective for diffusion models to our setting and show that its convergence guarantees carry over exactly. Second, we design a training protocol that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
jaibrdhn/persistworld
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition · Advanced Vision and Imaging