AstaBench: Rigorous Benchmarking of AI Agents with a Scientific Research Suite

Jonathan Bragg; Mike D'Arcy; Nishant Balepur; Dan Bareket; Bhavana Dalvi; Sergey Feldman; Dany Haddad; Jena D. Hwang; Peter Jansen; Varsha Kishore; Bodhisattwa Prasad Majumder; Aakanksha Naik; Sigal Rahamimov; Kyle Richardson; Amanpreet Singh; Harshit Surana; Aryeh Tiktinsky; Rosni Vasu; Guy Wiener; Chloe Anastasiades; Stefan Candra; Jason Dunkelberger; Dan Emery; Rob Evans; Malachi Hamada; Regan Huff; Rodney Kinney; Matt Latzke; Jaron Lochner; Ruben Lozano-Aguilera; Cecile Nguyen; Smita Rao; Amber Tanaka; Brooke Vlahos; Peter Clark; Doug Downey; Yoav Goldberg; Ashish Sabharwal; Daniel S. Weld

arXiv:2510.21652·cs.AI·April 23, 2026

AstaBench: Rigorous Benchmarking of AI Agents with a Scientific Research Suite

Jonathan Bragg, Mike D'Arcy, Nishant Balepur, Dan Bareket, Bhavana Dalvi, Sergey Feldman, Dany Haddad, Jena D. Hwang, Peter Jansen, Varsha Kishore, Bodhisattwa Prasad Majumder, Aakanksha Naik, Sigal Rahamimov, Kyle Richardson, Amanpreet Singh, Harshit Surana, Aryeh Tiktinsky

PDF

1 Repo 1 Video

TL;DR

AstaBench is a comprehensive benchmarking suite designed to evaluate AI agents in scientific research tasks, addressing limitations of previous benchmarks with reproducible tools, standardized interfaces, and holistic measures.

Contribution

The paper introduces AstaBench, a new benchmarking suite with 2400+ problems, production-grade tools, and diverse agent classes for rigorous evaluation of scientific AI agents.

Findings

01

AI agents show progress but still struggle with complex scientific tasks.

02

Reproducible evaluation tools improve comparison accuracy.

03

Baseline agents help identify genuine advancements.

Abstract

AI agents hold the potential to revolutionize scientific productivity by automating literature reviews, replicating experiments, analyzing data, and even proposing new directions of inquiry; indeed, there are now many such agents, ranging from general-purpose "deep research" systems to specialized science-specific agents, such as AI Scientist and AIGS. Rigorous evaluation of these agents is critical for progress. Yet existing benchmarks fall short on several fronts: they often (1) lack reproducible agent tools necessary for a controlled comparison of core agentic capabilities; (2) do not account for confounding variables such as model cost and tool access; (3) do not provide standardized interfaces for quick agent prototyping and evaluation; (4) fail to provide holistic, product-informed measures of real-world use cases such as science research; and (5) lack comprehensive baseline…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

allenai/asta-bench
github

Videos

AstaBench: Rigorous Benchmarking of AI Agents with a Scientific Research Suite· slideslive