MME-SCI: A Comprehensive and Challenging Science Benchmark for Multimodal Large Language Models
Jiacheng Ruan, Dan Jiang, Xian Gao, Ting Liu, Yuzhuo Fu, Yangyang Kang

TL;DR
This paper introduces MME-SCI, a new comprehensive benchmark with 1,019 high-quality scientific questions across multiple subjects, languages, and evaluation modes, to better assess the reasoning and knowledge capabilities of multimodal large language models.
Contribution
It presents MME-SCI, a challenging, multilingual, and multi-modal benchmark with fine-grained annotations, addressing gaps in existing scientific evaluation benchmarks for MLLMs.
Findings
Existing models perform poorly on MME-SCI, indicating high difficulty.
Multilingual and domain-specific weaknesses are identified in current models.
Benchmark reveals significant room for improvement in reasoning and knowledge coverage.
Abstract
Recently, multimodal large language models (MLLMs) have achieved significant advancements across various domains, and corresponding evaluation benchmarks have been continuously refined and improved. In this process, benchmarks in the scientific domain have played an important role in assessing the reasoning capabilities of MLLMs. However, existing benchmarks still face three key challenges: 1) Insufficient evaluation of models' reasoning abilities in multilingual scenarios; 2) Inadequate assessment of MLLMs' comprehensive modality coverage; 3) Lack of fine-grained annotation of scientific knowledge points. To address these gaps, we propose MME-SCI, a comprehensive and challenging benchmark. We carefully collected 1,019 high-quality question-answer pairs, which involve 3 distinct evaluation modes. These pairs cover four subjects, namely mathematics, physics, chemistry, and biology, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
