TL;DR
StepSTEM is a new benchmark designed to evaluate fine-grained multimodal reasoning in STEM tasks, revealing current models' reliance on textual reasoning and highlighting the need for improved cross-modal capabilities.
Contribution
We introduce StepSTEM, a rigorous, multi-modal STEM benchmark with a step-level evaluation framework for assessing reasoning processes in multimodal large language models.
Findings
Current models rely heavily on textual reasoning in STEM tasks.
Even advanced models achieve only around 38% accuracy on StepSTEM.
There is significant room for improvement in genuine cross-modal reasoning.
Abstract
Multimodal large language models (MLLMs) have shown promising reasoning abilities, yet evaluating their performance in specialized domains remains challenging. STEM reasoning is a particularly valuable testbed because it provides highly verifiable feedback, but existing benchmarks often permit unimodal shortcuts due to modality redundancy and focus mainly on final-answer accuracy, overlooking the reasoning process itself. To address this challenge, we introduce StepSTEM: a graduate-level benchmark of 283 problems across mathematics, physics, chemistry, biology, and engineering for fine-grained evaluation of cross-modal reasoning in MLLMs. StepSTEM is constructed through a rigorous curation pipeline that enforces strict complementarity between textual and visual inputs. We further propose a general step-level evaluation framework for both text-only chain-of-thought and interleaved…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
