RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation

Feng Jiang; Yang Chen; Kyle Xu; Yuchen Liu; Haifeng Wang; Zhenhao Shen; Jasper Lu; Shengze Huang; Yuanfei Wang; Chen Xie; Ruihai Wu

arXiv:2604.19092·cs.RO·May 15, 2026

RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation

Feng Jiang, Yang Chen, Kyle Xu, Yuchen Liu, Haifeng Wang, Zhenhao Shen, Jasper Lu, Shengze Huang, Yuanfei Wang, Chen Xie, Ruihai Wu

PDF

TL;DR

RoboWM-Bench is a new benchmark that evaluates the physical executability of predicted manipulation videos by translating them into action sequences and testing in simulation environments.

Contribution

It introduces a manipulation-centric benchmark that assesses whether generated videos can be converted into executable actions in physically grounded simulations.

Findings

01

Visual plausibility does not always imply executability.

02

Factors like spatial reasoning and contact prediction significantly impact execution success.

03

The benchmark reveals gaps in current video world models for robotic manipulation.

Abstract

Recent advances in large-scale video world models have enabled increasingly realistic future prediction, raising the prospect of using generated videos as scalable supervision for robot learning. However, for embodied manipulation, perceptual realism alone is not sufficient: generated interactions must also be physically consistent and executable by robotic agents. Existing benchmarks provide valuable assessments of visual quality and physical plausibility, but they do not systematically evaluate whether predicted behaviors can be translated into executable actions that complete manipulation tasks. We introduce RoboWM-Bench, a manipulation-centric benchmark for embodiment-grounded evaluation of video world models. RoboWM-Bench converts generated human-hand and robotic manipulation videos into embodied action sequences and validates them through execution in physically grounded…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.