EarthSE: A Benchmark for Evaluating Earth Scientific Exploration Capability of LLMs

Wanghan Xu; Xiangyu Zhao; Yuhao Zhou; Xiaoyu Yue; Ben Fei; Fenghua Ling; Wenlong Zhang; Lei Bai

arXiv:2505.17139·cs.CL·June 2, 2025

EarthSE: A Benchmark for Evaluating Earth Scientific Exploration Capability of LLMs

Wanghan Xu, Xiangyu Zhao, Yuhao Zhou, Xiaoyu Yue, Ben Fei, Fenghua Ling, Wenlong Zhang, Lei Bai

PDF

Open Access 3 Datasets

TL;DR

This paper introduces EarthSE, a comprehensive benchmark with datasets and metrics to evaluate large language models' ability to perform scientific exploration in Earth sciences, covering fundamental to advanced tasks.

Contribution

It presents a new holistic benchmark with diverse datasets and evaluation metrics specifically designed for assessing LLMs' Earth science exploration capabilities.

Findings

01

Leading LLMs show significant limitations in Earth science exploration tasks.

02

The benchmark reveals substantial room for improvement in LLMs' scientific reasoning.

03

EarthSE provides a standardized platform for future research in scientific exploration evaluation.

Abstract

Advancements in Large Language Models (LLMs) drive interest in scientific applications, necessitating specialized benchmarks such as Earth science. Existing benchmarks either present a general science focus devoid of Earth science specificity or cover isolated subdomains, lacking holistic evaluation. Furthermore, current benchmarks typically neglect the assessment of LLMs' capabilities in open-ended scientific exploration. In this paper, we present a comprehensive and professional benchmark for the Earth sciences, designed to evaluate the capabilities of LLMs in scientific exploration within this domain, spanning from fundamental to advanced levels. Leveraging a corpus of 100,000 research papers, we first construct two Question Answering (QA) datasets: Earth-Iron, which offers extensive question coverage for broad assessment, and Earth-Silver, which features a higher level of difficulty…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsScientific Computing and Data Management · Distributed and Parallel Computing Systems · Research Data Management Practices

MethodsFocus