PhysicsMind: Sim and Real Mechanics Benchmarking for Physical Reasoning and Prediction in Foundational VLMs and World Models

Chak-Wing Mak; Guanyu Zhu; Boyi Zhang; Hongji Li; Xiaowei Chi; Kevin Zhang; Yichen Wu; Yangfan He; Chun-Kai Fan; Wentao Lu; Kuangzhi Ge; Xinyu Fang; Hongyang He; Kuan Lu; Tianxiang Xu; Li Zhang; Yongxin Ni; Youhua Li; Shanghang Zhang

arXiv:2601.16007·cs.CV·January 23, 2026

PhysicsMind: Sim and Real Mechanics Benchmarking for Physical Reasoning and Prediction in Foundational VLMs and World Models

Chak-Wing Mak, Guanyu Zhu, Boyi Zhang, Hongji Li, Xiaowei Chi, Kevin Zhang, Yichen Wu, Yangfan He, Chun-Kai Fan, Wentao Lu, Kuangzhi Ge, Xinyu Fang, Hongyang He, Kuan Lu, Tianxiang Xu, Li Zhang, Yongxin Ni, Youhua Li, Shanghang Zhang

PDF

Open Access

TL;DR

PhysicsMind introduces a comprehensive benchmark for evaluating foundational multimodal models' understanding of core physical principles through reasoning and generation tasks in real and simulated environments.

Contribution

It provides the first unified benchmark assessing physical law reasoning and generation in multimodal models, covering both real and simulation environments with canonical physics principles.

Findings

01

Models often rely on appearance heuristics rather than physics.

02

Current models frequently violate basic mechanics principles.

03

Scaling and training are insufficient for robust physical understanding.

Abstract

Modern foundational Multimodal Large Language Models (MLLMs) and video world models have advanced significantly in mathematical, common-sense, and visual reasoning, but their grasp of the underlying physics remains underexplored. Existing benchmarks attempting to measure this matter rely on synthetic, Visual Question Answer templates or focus on perceptual video quality that is tangential to measuring how well the video abides by physical laws. To address this fragmentation, we introduce PhysicsMind, a unified benchmark with both real and simulation environments that evaluates law-consistent reasoning and generation over three canonical principles: Center of Mass, Lever Equilibrium, and Newton's First Law. PhysicsMind comprises two main tasks: i) VQA tasks, testing whether models can reason and determine physical quantities and values from images or short videos, and ii) Video…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Generative Adversarial Networks and Image Synthesis