Scientists' First Exam: Probing Cognitive Abilities of MLLM via Perception, Understanding, and Reasoning

Yuhao Zhou; Yiheng Wang; Xuming He; Ao Shen; Ruoyao Xiao; Zhiwei Li; Qiantai Feng; Zijie Guo; Yuejin Yang; Hao Wu; Wenxuan Huang; Jiaqi Wei; Dan Si; Xiuqi Yao; Jia Bu; Haiwen Huang; Manning Wang; Tianfan Fu; Shixiang Tang; Ben Fei; Dongzhan Zhou; Fenghua Ling; Yan Lu; Siqi Sun; Chenhui Li; Guanjie Zheng; Jiancheng Lv; Wenlong Zhang; Lei Bai

arXiv:2506.10521·cs.AI·November 17, 2025

Scientists' First Exam: Probing Cognitive Abilities of MLLM via Perception, Understanding, and Reasoning

Yuhao Zhou, Yiheng Wang, Xuming He, Ao Shen, Ruoyao Xiao, Zhiwei Li, Qiantai Feng, Zijie Guo, Yuejin Yang, Hao Wu, Wenxuan Huang, Jiaqi Wei, Dan Si, Xiuqi Yao, Jia Bu, Haiwen Huang, Manning Wang, Tianfan Fu, Shixiang Tang, Ben Fei, Dongzhan Zhou, Fenghua Ling, Yan Lu, Siqi Sun

PDF

Open Access 1 Datasets

TL;DR

The paper introduces the Scientists' First Exam (SFE), a comprehensive benchmark to evaluate the perception, understanding, and reasoning abilities of scientific Multimodal Large Language Models (MLLMs), revealing current models' limitations.

Contribution

It presents the SFE benchmark with 830 expert-verified VQA pairs across multiple scientific disciplines, addressing the gap in assessing MLLMs' scientific perception and reasoning capabilities.

Findings

01

Current models achieve only around 30% accuracy on SFE.

02

SFE covers 66 multimodal tasks across five scientific disciplines.

03

The benchmark highlights significant room for improvement in scientific MLLMs.

Abstract

Scientific discoveries increasingly rely on complex multimodal reasoning based on information-intensive scientific data and domain-specific expertise. Empowered by expert-level scientific benchmarks, scientific Multimodal Large Language Models (MLLMs) hold the potential to significantly enhance this discovery process in realistic workflows. However, current scientific benchmarks mostly focus on evaluating the knowledge understanding capabilities of MLLMs, leading to an inadequate assessment of their perception and reasoning abilities. To address this gap, we present the Scientists' First Exam (SFE) benchmark, designed to evaluate the scientific cognitive capacities of MLLMs through three interconnected levels: scientific signal perception, scientific attribute understanding, scientific comparative reasoning. Specifically, SFE comprises 830 expert-verified VQA pairs across three question…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

InternScience/SFE
dataset· 1.3k dl
1.3k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIntelligent Tutoring Systems and Adaptive Learning · Innovative Teaching and Learning Methods · Educational Technology and Assessment

MethodsFocus