From Diagnosis to Improvement: Probing Spatio-Physical Reasoning in Vision Language Models

Tiancheng Han; Yunfei Gao; Yong Li; Wuzhou Yu; Qiaosheng Zhang; Wenqi Shao

arXiv:2508.10770·cs.CV·August 15, 2025

From Diagnosis to Improvement: Probing Spatio-Physical Reasoning in Vision Language Models

Tiancheng Han, Yunfei Gao, Yong Li, Wuzhou Yu, Qiaosheng Zhang, Wenqi Shao

PDF

TL;DR

This paper analyzes the limitations of current vision language models in spatio-physical reasoning, diagnosing their shortcomings, and applying fine-tuning and reinforcement learning to improve their capabilities, though generalization remains challenging.

Contribution

It provides a comprehensive diagnostic of VLMs' spatio-physical reasoning and introduces a fine-tuning and reinforcement learning approach to enhance their performance.

Findings

01

Current models perform inadequately on spatio-physical reasoning.

02

Fine-tuning and reinforcement learning significantly improve reasoning capabilities.

03

Generalization to new physics scenarios remains limited.

Abstract

Spatio-physical reasoning, a foundation capability for understanding the real physics world, is a critical step towards building robust world models. While recent vision language models (VLMs) have shown remarkable progress in specialized domains like multimodal mathematics and pure spatial understanding, their capability for spatio-physical reasoning remains largely unexplored. This paper provides a comprehensive diagnostic analysis of mainstream VLMs, revealing that current models perform inadequately on this crucial task. Further detailed analysis shows that this underperformance is largely attributable to biases caused by human-like prior and a lack of deep reasoning. To address these challenges, we apply supervised fine-tuning followed by rule-based reinforcement learning to Qwen2.5-VL-7B, resulting in significant improvements in spatio-physical reasoning capabilities and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.