AIRS-Bench: a Suite of Tasks for Frontier AI Research Science Agents

Alisia Lupidi; Bhavul Gauri; Thomas Simon Foster; Bassel Al Omari; Despoina Magka; Alberto Pepe; Alexis Audran-Reiss; Muna Aghamelu; Nicolas Baldwin; Lucia Cipolina-Kun; Jean-Christophe Gagnon-Audet; Chee Hau Leow; Sandra Lefdal; Hossam Mossalam; Abhinav Moudgil; Saba Nazir; Emanuel Tewolde; Isabel Urrego; Jordi Armengol Estape; Amar Budhiraja; Gaurav Chaurasia; Abhishek Charnalia; Derek Dunfield; Karen Hambardzumyan; Daniel Izcovich; Martin Josifoski; Ishita Mediratta; Kelvin Niu; Parth Pathak; Michael Shvartsman; Edan Toledo; Anton Protopopov; Roberta Raileanu; Alexander Miller; Tatiana Shavrina; Jakob Foerster; Yoram Bachrach

arXiv:2602.06855·cs.AI·February 17, 2026

AIRS-Bench: a Suite of Tasks for Frontier AI Research Science Agents

Alisia Lupidi, Bhavul Gauri, Thomas Simon Foster, Bassel Al Omari, Despoina Magka, Alberto Pepe, Alexis Audran-Reiss, Muna Aghamelu, Nicolas Baldwin, Lucia Cipolina-Kun, Jean-Christophe Gagnon-Audet, Chee Hau Leow, Sandra Lefdal, Hossam Mossalam, Abhinav Moudgil, Saba Nazir

PDF

Open Access 1 Datasets

TL;DR

AIRS-Bench is a comprehensive suite of 20 diverse tasks designed to evaluate and advance AI agents' capabilities across the entire scientific research process, highlighting current strengths and areas for improvement.

Contribution

Introduces AIRS-Bench, a versatile, open-source benchmark for assessing AI agents in scientific research tasks across multiple domains and research stages.

Findings

01

Agents outperform humans in 4 tasks

02

Agents do not reach theoretical performance limits

03

Significant room for improvement remains

Abstract

LLM agents hold significant promise for advancing scientific research. To accelerate this progress, we introduce AIRS-Bench (the AI Research Science Benchmark), a suite of 20 tasks sourced from state-of-the-art machine learning papers. These tasks span diverse domains, including language modeling, mathematics, bioinformatics, and time series forecasting. AIRS-Bench tasks assess agentic capabilities over the full research lifecycle -- including idea generation, experiment analysis and iterative refinement -- without providing baseline code. The AIRS-Bench task format is versatile, enabling easy integration of new tasks and rigorous comparison across different agentic frameworks. We establish baselines using frontier models paired with both sequential and parallel scaffolds. Our results show that agents exceed human SOTA in four tasks but fail to match it in sixteen others. Even when…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

facebook/airs-bench
dataset· 229 dl
229 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Materials Science · Artificial Intelligence in Healthcare and Education · Explainable Artificial Intelligence (XAI)