WorldBench: Disambiguating Physics for Diagnostic Evaluation of World Models

Rishi Upadhyay; Howard Zhang; Jim Solomon; Ayush Agrawal; Pranay Boreddy; Shruti Satya Narayana; Yunhao Ba; Alex Wong; Celso M de Melo; Achuta Kadambi

arXiv:2601.21282·cs.CV·January 30, 2026

WorldBench: Disambiguating Physics for Diagnostic Evaluation of World Models

Rishi Upadhyay, Howard Zhang, Jim Solomon, Ayush Agrawal, Pranay Boreddy, Shruti Satya Narayana, Yunhao Ba, Alex Wong, Celso M de Melo, Achuta Kadambi

PDF

Open Access

TL;DR

WorldBench is a new video benchmark designed to evaluate the understanding of individual physical concepts in world models, revealing specific weaknesses and enabling more reliable assessment of physical reasoning in generative models.

Contribution

It introduces concept-specific, disentangled evaluation benchmarks for physical understanding, addressing limitations of existing entangled physics benchmarks.

Findings

01

State-of-the-art models show specific failures in physical reasoning.

02

Models lack physical consistency in generating real-world interactions.

03

WorldBench enables nuanced evaluation of physical concept understanding.

Abstract

Recent advances in generative foundational models, often termed "world models," have propelled interest in applying them to critical tasks like robotic planning and autonomous system training. For reliable deployment, these models must exhibit high physical fidelity, accurately simulating real-world dynamics. Existing physics-based video benchmarks, however, suffer from entanglement, where a single test simultaneously evaluates multiple physical laws and concepts, fundamentally limiting their diagnostic capability. We introduce WorldBench, a novel video-based benchmark specifically designed for concept-specific, disentangled evaluation, allowing us to rigorously isolate and assess understanding of a single physical concept or law at a time. To make WorldBench comprehensive, we design benchmarks at two different levels: 1) an evaluation of intuitive physical understanding with concepts…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Robot Manipulation and Learning · Human Pose and Action Recognition