PhysUniBench: A Multi-Modal Physics Reasoning Benchmark at Undergraduate Level
Lintao Wang, Encheng Su, Jiaqi Liu, Pengze Li, Jiabei Xiao, Wenlong Zhang, Xinnan Dai, Xi Chen, Yuan Meng, Lei Bai, Wanli Ouyang, Shixiang Tang, Aoran Wang, Xinzhu Ma

TL;DR
PhysUniBench is a comprehensive multimodal benchmark designed to evaluate and enhance AI models' physics reasoning at the undergraduate level, covering diverse sub-disciplines and problem types with visual diagrams.
Contribution
This paper introduces PhysUniBench, a large-scale, rigorously curated physics reasoning benchmark for multimodal large language models, filling a gap in standardized undergraduate physics evaluation.
Findings
Current models struggle with physics reasoning, achieving only 51.6% accuracy.
Models find multi-step problems and diagram interpretation particularly challenging.
PhysUniBench provides a new standard for assessing AI's physics understanding.
Abstract
Physics problem-solving is a challenging domain for AI models, requiring integration of conceptual understanding, mathematical reasoning, and interpretation of physical diagrams. Existing evaluations fail to capture the full breadth and complexity of undergraduate physics, whereas this level provides a rigorous yet standardized testbed for pedagogical assessment of multi-step physical reasoning. To this end, we present PhysUniBench, a large-scale multimodal benchmark designed to evaluate and improve the reasoning capabilities of multimodal large language models (MLLMs) specifically on undergraduate-level physics problems. PhysUniBench consists of 3,304 physics questions spanning 8 major sub-disciplines of physics, each accompanied by one visual diagram. The benchmark includes both open-ended and multiple-choice questions, systematically curated and difficulty-rated through an iterative…
Peer Reviews
Decision·Submitted to ICLR 2026
- First comprehensive multimodal benchmark for undergraduate-level physics reasoning. - Rigorous curation process with difficulty calibration and quality control. - Extensive evaluation across sub-disciplines provides clear diagnostic insights. - Well-written and potentially impactful for advancing AI-for-Science.
### 1. Lack of actionable guidance for model improvement While the benchmark offers valuable insights into the limitations of current MLLMs, the paper does not sufficiently explore how architectural or training modifications (*e.g.*, physics-informed modules, structured reasoning layers, or symbolic integration) could help enhance physical perception and reasoning. ### 2. Limited analysis of reasoning failures The error analysis mainly reports accuracy drops across sub-disciplines and difficul
- The benchmark targets undergrad physics with calibrated difficulty and multilingual EN/ZH support The paper uses a model-in-the-loop curation to remove trivially solvable items and to stratify difficulty; this is a thoughtful twist on dataset construction - There is a clear dataset stats and coverage across different subfields, with balanced difficulty bins - The caption ablation is a smart diagnostic revealing a likely bottleneck in visual/diagram understanding for physics - The results table
- Difficulty calibration (Qwen2.5-VL rollouts) and LLM judging with GPT-4o may inject model-specific biases; it’s unclear how sensitive results are to the choice of judge/rollout model or to prompt templates. A cross-judge analysis (or human spot-checks) would strengthen validity - Potential data contamination: It is not clear where exactly the datasets are from. The paper mentioned that problems came from textbooks/exams/competitions, but many may already appear online. The paper does not quant
(1) Significant and Well-Defined Research Gap: This paper successfully addresses a significant gap in existing physics benchmarks: the lack of a fully multimodal, undergraduate-level benchmark. It strikes a good balance between K-12/Olympiad-level and text-based university-level benchmarks, providing a suitable platform for assessing the comprehensive reasoning skills crucial for future scientists and engineers. (2) Rigorous Dataset Construction Process: The dataset construction process descr
(1) Potential Bias in Difficulty Stratification: The difficulty stratification of the entire benchmark relies entirely on the performance of a single model (Qwen2.5-VL-72B). This can lead to difficulty ratings that are biased by that model. A problem that is difficult for the Qwen model may not be difficult for a model with a different architecture or training data (e.g., the GPT series), and vice versa. This may affect the generalizability of the difficulty stratification. (2) Contradiction i
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIntelligent Tutoring Systems and Adaptive Learning · Science Education and Pedagogy · Multimodal Machine Learning Applications
