Unveiling Fine-Grained Visual Traces: Evaluating Multimodal Interleaved Reasoning Chains in Multimodal STEM Tasks

Jing Jin; Hao Liu; Yan Bai; Yihang Lou; Zhenke Wang; Tianrun Yuan; Juntong Chen; Yongkang Zhu; Fanhu Zeng; Xuanyu Zhu; Tao Feng; Yige Xu

arXiv:2604.19697·cs.CV·May 11, 2026

Unveiling Fine-Grained Visual Traces: Evaluating Multimodal Interleaved Reasoning Chains in Multimodal STEM Tasks

Jing Jin, Hao Liu, Yan Bai, Yihang Lou, Zhenke Wang, Tianrun Yuan, Juntong Chen, Yongkang Zhu, Fanhu Zeng, Xuanyu Zhu, Tao Feng, Yige Xu

PDF

1 Repo

TL;DR

StepSTEM is a new benchmark designed to evaluate fine-grained multimodal reasoning in STEM tasks, revealing current models' reliance on textual reasoning and highlighting the need for improved cross-modal capabilities.

Contribution

We introduce StepSTEM, a rigorous, multi-modal STEM benchmark with a step-level evaluation framework for assessing reasoning processes in multimodal large language models.

Findings

01

Current models rely heavily on textual reasoning in STEM tasks.

02

Even advanced models achieve only around 38% accuracy on StepSTEM.

03

There is significant room for improvement in genuine cross-modal reasoning.

Abstract

Multimodal large language models (MLLMs) have shown promising reasoning abilities, yet evaluating their performance in specialized domains remains challenging. STEM reasoning is a particularly valuable testbed because it provides highly verifiable feedback, but existing benchmarks often permit unimodal shortcuts due to modality redundancy and focus mainly on final-answer accuracy, overlooking the reasoning process itself. To address this challenge, we introduce StepSTEM: a graduate-level benchmark of 283 problems across mathematics, physics, chemistry, biology, and engineering for fine-grained evaluation of cross-modal reasoning in MLLMs. StepSTEM is constructed through a rigorous curation pipeline that enforces strict complementarity between textual and visual inputs. We further propose a general step-level evaluation framework for both text-only chain-of-thought and interleaved…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lll-hhh/STEPSTEM
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.