MSEarth: A Multimodal Benchmark for Earth Science Phenomenon Discovery with MLLMs
Xiangyu Zhao, Wanghan Xu, Bo Liu, Yuhao Zhou, Fenghua Ling, Ben Fei, Xiaoyu Yue, Lei Bai, Wenlong Zhang, Xiao-Ming Wu

TL;DR
MSEarth is a comprehensive multimodal benchmark dataset designed to evaluate large language models' scientific reasoning in earth science, covering five spheres with over 289,000 figures and supporting various reasoning tasks.
Contribution
The paper introduces MSEarth, a high-quality, open-access multimodal dataset and benchmark for earth science, enabling better evaluation of MLLMs' scientific reasoning capabilities.
Findings
MSEarth includes over 289,000 figures with detailed captions and reasoning.
The benchmark supports figure captioning, multiple choice, and open-ended reasoning tasks.
It provides a scalable resource for developing earth science MLLMs.
Abstract
The rapid advancement of multimodal large language models (MLLMs) offers new opportunities for complex scientific challenges, yet their application in earth science-especially at the graduate level-remains underexplored due to a lack of benchmarks reflecting the depth and complexity of geoscientific reasoning. Existing datasets often rely on synthetic data or simple figure-caption pairs, failing to capture the nuanced reasoning required for real-world applications. To address this, we introduce MSEarth, a multimodal scientific dataset and benchmark curated from high-quality, open-access publications. Covering the five major spheres of Earth science-atmosphere, cryosphere, hydrosphere, lithosphere, and biosphere-MSEarth features over 289K figures with refined captions enriched by contextual discussions and reasoning from the original papers. The benchmark supports tasks such as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
