VisPhyWorld: Probing Physical Reasoning via Code-Driven Video Reconstruction

Jiarong Liang; Max Ku; Ka-Hei Hui; Ping Nie; Wenhu Chen

arXiv:2602.13294·cs.CV·May 22, 2026

VisPhyWorld: Probing Physical Reasoning via Code-Driven Video Reconstruction

Jiarong Liang, Max Ku, Ka-Hei Hui, Ping Nie, Wenhu Chen

PDF

1 Repo 2 Datasets

TL;DR

VisPhyWorld introduces an execution-based framework for evaluating physical reasoning in models by generating and inspecting executable simulator code from visual data, enabling more testable and falsifiable assessments.

Contribution

The paper presents VisPhyWorld, a novel framework that evaluates physical reasoning through code generation, and introduces VisPhyBench, a comprehensive benchmark for physical scene understanding.

Findings

01

State-of-the-art models excel in semantic scene understanding but struggle with physical parameter inference.

02

The pipeline achieves 97.7% valid reconstructed videos before fallback.

03

Models have difficulty simulating consistent physical dynamics.

Abstract

Evaluating whether Multimodal Large Language Models (MLLMs) genuinely reason about physical dynamics remains challenging. Most existing benchmarks rely on recognition-style protocols such as Visual Question Answering (VQA) and Violation of Expectation (VoE), which can often be answered without committing to an explicit, testable physical hypothesis. We propose VisPhyWorld, an execution-based framework that evaluates physical reasoning by requiring models to generate executable simulator code from visual observations. By producing runnable code, the inferred world representation is directly inspectable, editable, and falsifiable. This separates physical reasoning from rendering. Building on this framework, we introduce VisPhyBench, comprising 209 evaluation scenes derived from 108 physical templates and a systematic protocol that evaluates how well models reconstruct appearance and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

TIGER-AI-Lab/VisPhyWorld
github

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition