MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal   Scientific Understanding

Zekun Li; Xianjun Yang; Kyuri Choi; Wanrong Zhu; Ryan Hsieh; HyeonJung; Kim; Jin Hyuk Lim; Sungyoung Ji; Byungju Lee; Xifeng Yan; Linda Ruth Petzold,; Stephen D. Wilson; Woosang Lim; William Yang Wang

arXiv:2407.04903·cs.CL·February 21, 2025·1 cites

MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding

Zekun Li, Xianjun Yang, Kyuri Choi, Wanrong Zhu, Ryan Hsieh, HyeonJung, Kim, Jin Hyuk Lim, Sungyoung Ji, Byungju Lee, Xifeng Yan, Linda Ruth Petzold,, Stephen D. Wilson, Woosang Lim, William Yang Wang

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper introduces a comprehensive, multi-disciplinary dataset for scientific figure interpretation, enabling advanced AI models to understand complex scientific visuals and outperform human experts in certain tasks.

Contribution

The paper presents a large, diverse dataset of complex scientific figures from 72 fields, and demonstrates its effectiveness in training models that surpass existing benchmarks and human performance.

Findings

01

Models fine-tuned on the dataset outperform GPT-4o and humans in multiple-choice tasks.

02

Pre-training on article-figure data improves performance in materials science.

03

The dataset covers complex visualizations requiring graduate-level expertise.

Abstract

Scientific figure interpretation is a crucial capability for AI-driven scientific assistants built on advanced Large Vision Language Models. However, current datasets and benchmarks primarily focus on simple charts or other relatively straightforward figures from limited science domains. To address this gap, we present a comprehensive dataset compiled from peer-reviewed Nature Communications articles covering 72 scientific fields, encompassing complex visualizations such as schematic diagrams, microscopic images, and experimental data which require graduate-level expertise to interpret. We evaluated 19 proprietary and open-source models on two benchmark tasks, figure captioning and multiple-choice, and conducted human expert annotation. Our analysis revealed significant task challenges and performance gaps among models. Beyond serving as a benchmark, this dataset serves as a valuable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

leezekun/mmsci
pytorchOfficial

Datasets

MMSci/NatureCommsCorpus
dataset· 24 dl
24 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling

MethodsFocus