MME-SCI: A Comprehensive and Challenging Science Benchmark for Multimodal Large Language Models

Jiacheng Ruan; Dan Jiang; Xian Gao; Ting Liu; Yuzhuo Fu; Yangyang Kang

arXiv:2508.13938·cs.CL·August 20, 2025

MME-SCI: A Comprehensive and Challenging Science Benchmark for Multimodal Large Language Models

Jiacheng Ruan, Dan Jiang, Xian Gao, Ting Liu, Yuzhuo Fu, Yangyang Kang

PDF

1 Datasets 1 Video

TL;DR

This paper introduces MME-SCI, a new comprehensive benchmark with 1,019 high-quality scientific questions across multiple subjects, languages, and evaluation modes, to better assess the reasoning and knowledge capabilities of multimodal large language models.

Contribution

It presents MME-SCI, a challenging, multilingual, and multi-modal benchmark with fine-grained annotations, addressing gaps in existing scientific evaluation benchmarks for MLLMs.

Findings

01

Existing models perform poorly on MME-SCI, indicating high difficulty.

02

Multilingual and domain-specific weaknesses are identified in current models.

03

Benchmark reveals significant room for improvement in reasoning and knowledge coverage.

Abstract

Recently, multimodal large language models (MLLMs) have achieved significant advancements across various domains, and corresponding evaluation benchmarks have been continuously refined and improved. In this process, benchmarks in the scientific domain have played an important role in assessing the reasoning capabilities of MLLMs. However, existing benchmarks still face three key challenges: 1) Insufficient evaluation of models' reasoning abilities in multilingual scenarios; 2) Inadequate assessment of MLLMs' comprehensive modality coverage; 3) Lack of fine-grained annotation of scientific knowledge points. To address these gaps, we propose MME-SCI, a comprehensive and challenging benchmark. We carefully collected 1,019 high-quality question-answer pairs, which involve 3 distinct evaluation modes. These pairs cover four subjects, namely mathematics, physics, chemistry, and biology, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

JCruan/MME-SCI
dataset· 551 dl
551 dl

Videos

MME-SCI: A Comprehensive and Challenging Science Benchmark for Multimodal Large Language Models· underline