RoboAlign-R1: Distilled Multimodal Reward Alignment for Robot Video World Models

Hao Wu; Yuqi Li; Yuan Gao; Fan Xu; Fan Zhang; Kun Wang; Penghao Zhao; Qiufeng Wang; Yizhou Zhao; Weiyan Wang; Yingli Tian; Xian Wu; Xiaomeng Huang

arXiv:2605.03821·cs.RO·May 6, 2026

RoboAlign-R1: Distilled Multimodal Reward Alignment for Robot Video World Models

Hao Wu, Yuqi Li, Yuan Gao, Fan Xu, Fan Zhang, Kun Wang, Penghao Zhao, Qiufeng Wang, Yizhou Zhao, Weiyan Wang, Yingli Tian, Xian Wu, Xiaomeng Huang

PDF

TL;DR

RoboAlign-R1 enhances robot video world models by aligning them with reward signals and stabilizing long-horizon predictions, leading to improved task performance and realism.

Contribution

It introduces a reward-aligned post-training framework with a new inference strategy, and constructs a comprehensive benchmark for evaluation.

Findings

01

10.1% improvement in aggregate six-dimension score

02

7.5% gain in Manipulation Accuracy

03

SWR increases long-horizon prediction quality with minimal latency

Abstract

Existing robot video world models are typically trained with low-level objectives such as reconstruction and perceptual similarity, which are poorly aligned with the capabilities that matter most for robot decision making, including instruction following, manipulation success, and physical plausibility. They also suffer from error accumulation in long-horizon autoregressive prediction. We present RoboAlign-R1, a framework that combines reward-aligned post-training with stabilized long-horizon inference for robot video world models. We construct RobotWorldBench, a benchmark of 10,000 annotated video-instruction pairs collected from four robot data sources, and train a multimodal teacher judge, RoboAlign-Judge, to provide fine-grained six-dimensional evaluation of generated videos. We then distill the teacher into a lightweight student reward model for efficient…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.