QuantiPhy: A Quantitative Benchmark Evaluating Physical Reasoning Abilities of Vision-Language Models

Li Puyin; Tiange Xiang; Ella Mao; Shirley Wei; Xinye Chen; Adnan Masood; Li Fei-fei; Ehsan Adeli

arXiv:2512.19526·cs.AI·December 23, 2025

QuantiPhy: A Quantitative Benchmark Evaluating Physical Reasoning Abilities of Vision-Language Models

Li Puyin, Tiange Xiang, Ella Mao, Shirley Wei, Xinye Chen, Adnan Masood, Li Fei-fei, Ehsan Adeli

PDF

Open Access 2 Datasets

TL;DR

QuantiPhy is a new benchmark that quantitatively assesses vision-language models' ability to reason about physical properties like size, velocity, and acceleration from videos, revealing gaps between plausibility and numerical accuracy.

Contribution

This paper introduces QuantiPhy, the first benchmark for quantitatively evaluating physical reasoning in vision-language models using video-text data with ground truth measurements.

Findings

01

State-of-the-art VLMs show a gap between qualitative plausibility and numerical correctness.

02

Models tend to rely on pre-trained world knowledge rather than visual inputs.

03

QuantiPhy provides a scalable, rigorous testbed for physical reasoning evaluation.

Abstract

Understanding the physical world is essential for generalist AI agents. However, it remains unclear whether state-of-the-art vision perception models (e.g., large VLMs) can reason physical properties quantitatively. Existing evaluations are predominantly VQA-based and qualitative, offering limited insight into whether these models can infer the kinematic quantities of moving objects from video observations. To address this, we present QuantiPhy, the first benchmark designed to quantitatively measure a VLM's physical reasoning ability. Comprising more than 3.3K video-text instances with numerical ground truth, QuantiPhy evaluates a VLM's performance on estimating an object's size, velocity, and acceleration at a given timestamp, using one of these properties as an input prior. The benchmark standardizes prompts and scoring to assess numerical accuracy, enabling fair comparisons across…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Generative Adversarial Networks and Image Synthesis