SeePhys: Does Seeing Help Thinking? -- Benchmarking Vision-Based Physics Reasoning

Kun Xiang; Heng Li; Terry Jingchen Zhang; Yinya Huang; Zirong Liu; Peixin Qu; Jixi He; Jiaqi Chen; Yu-Jie Yuan; Jianhua Han; Hang Xu; Hanhui Li; Mrinmaya Sachan; Xiaodan Liang

arXiv:2505.19099·cs.AI·October 7, 2025

SeePhys: Does Seeing Help Thinking? -- Benchmarking Vision-Based Physics Reasoning

Kun Xiang, Heng Li, Terry Jingchen Zhang, Yinya Huang, Zirong Liu, Peixin Qu, Jixi He, Jiaqi Chen, Yu-Jie Yuan, Jianhua Han, Hang Xu, Hanhui Li, Mrinmaya Sachan, Xiaodan Liang

PDF

Open Access 1 Repo 1 Datasets

TL;DR

SeePhys is a comprehensive benchmark testing vision-based physics reasoning in large language models, revealing significant challenges in visual understanding and reasoning integration.

Contribution

The paper introduces SeePhys, a large-scale, diverse benchmark emphasizing vision-essential physics problems to evaluate and challenge current multimodal reasoning models.

Findings

01

Most advanced models score below 60% accuracy.

02

Visual reasoning remains a significant challenge for LLMs.

03

Models tend to rely on textual cues rather than visual understanding.

Abstract

We present SeePhys, a large-scale multimodal benchmark for LLM reasoning grounded in physics questions ranging from middle school to PhD qualifying exams. The benchmark covers 7 fundamental domains spanning the physics discipline, incorporating 21 categories of highly heterogeneous diagrams. In contrast to prior works where visual elements mainly serve auxiliary purposes, our benchmark features a substantial proportion of vision-essential problems (75%) that mandate visual information extraction for correct solutions. Through extensive evaluation, we observe that even the most advanced visual reasoning models (e.g., Gemini-2.5-pro and o4-mini) achieve sub-60% accuracy on our benchmark. These results reveal fundamental challenges in current large language models' visual understanding capabilities, particularly in: (i) establishing rigorous coupling between diagram interpretation and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

seephys/seephys-project
pytorchOfficial

Datasets

SeePhys/SeePhys
dataset· 157 dl
157 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Data Visualization and Analytics · Science Education and Pedagogy