ScienceAgentBench: Toward Rigorous Assessment of Language Agents for   Data-Driven Scientific Discovery

Ziru Chen; Shijie Chen; Yuting Ning; Qianheng Zhang; Boshi Wang; Botao; Yu; Yifei Li; Zeyi Liao; Chen Wei; Zitong Lu; Vishal Dey; Mingyi Xue; Frazier; N. Baker; Benjamin Burns; Daniel Adu-Ampratwum; Xuhui Huang; Xia Ning; Song; Gao; Yu Su; Huan Sun

arXiv:2410.05080·cs.CL·April 1, 2025·6 cites

ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery

Ziru Chen, Shijie Chen, Yuting Ning, Qianheng Zhang, Boshi Wang, Botao, Yu, Yifei Li, Zeyi Liao, Chen Wei, Zitong Lu, Vishal Dey, Mingyi Xue, Frazier, N. Baker, Benjamin Burns, Daniel Adu-Ampratwum, Xuhui Huang, Xia Ning, Song, Gao, Yu Su, Huan Sun

PDF

Open Access 1 Repo 2 Datasets 1 Video

TL;DR

ScienceAgentBench is a comprehensive benchmark designed to rigorously evaluate the capabilities of language agents in data-driven scientific discovery, revealing current limitations and guiding future improvements.

Contribution

The paper introduces ScienceAgentBench, a validated, multi-disciplinary benchmark with 102 tasks, evaluation metrics, and strategies to assess and improve scientific language agents.

Findings

01

Best agents solve only ~32% of tasks independently

02

Performance improves to 42% with increased compute and cost

03

Current language agents have significant limitations in scientific code generation

Abstract

The advancements of large language models (LLMs) have piqued growing interest in developing LLM-based language agents to automate scientific discovery end-to-end, which has sparked both excitement and skepticism about their true capabilities. In this work, we call for rigorous assessment of agents on individual tasks in a scientific workflow before making bold claims on end-to-end automation. To this end, we present ScienceAgentBench, a new benchmark for evaluating language agents for data-driven scientific discovery. To ensure the scientific authenticity and real-world relevance of our benchmark, we extract 102 tasks from 44 peer-reviewed publications in four disciplines and engage nine subject matter experts to validate them. We unify the target output for every task to a self-contained Python program file and employ an array of evaluation metrics to examine the generated programs,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

osu-nlp-group/scienceagentbench
none

Datasets

Videos

ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery· slideslive

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Biomedical Text Mining and Ontologies