ReWorld: Multi-Dimensional Reward Modeling for Embodied World Models
Baorui Peng, Wenyao Zhang, Liang Xu, Zekun Qi, Jiazhao Zhang, Hongsi Liu, Wenjun Zeng, Xin Jin

TL;DR
ReWorld introduces a reinforcement learning framework that enhances video-based embodied world models by aligning them with physical realism, task performance, and visual quality through a large-scale human preference dataset and a hierarchical reward model.
Contribution
The paper presents ReWorld, a novel framework that employs multi-dimensional reward modeling and reinforcement learning to improve physical fidelity, logical coherence, and visual quality in embodied world models.
Findings
ReWorld significantly improves physical realism and task performance.
The hierarchical reward model effectively captures human preferences.
ReWorld outperforms previous methods in various evaluations.
Abstract
Recently, video-based world models that learn to simulate the dynamics have gained increasing attention in robot learning. However, current approaches primarily emphasize visual generative quality while overlooking physical fidelity, dynamic consistency, and task logic, especially for contact-rich manipulation tasks, which limits their applicability to downstream tasks. To this end, we introduce ReWorld, a framework aimed to employ reinforcement learning to align the video-based embodied world models with physical realism, task completion capability, embodiment plausibility and visual quality. Specifically, we first construct a large-scale (~235K) video preference dataset and employ it to train a hierarchical reward model designed to capture multi-dimensional reward consistent with human preferences. We further propose a practical alignment algorithm that post-trains flow-based world…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSocial Robot Interaction and HRI · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications
