AIRA_2: Overcoming Bottlenecks in AI Research Agents

Karen Hambardzumyan; Nicolas Baldwin; Edan Toledo; Rishi Hazra; Michael Kuchnik; Bassel Al Omari; Thomas Simon Foster; Anton Protopopov; Jean-Christophe Gagnon-Audet; Ishita Mediratta; Kelvin Niu; Michael Shvartsman; Alisia Lupidi; Alexis Audran-Reiss; Parth Pathak; Tatiana Shavrina; Despoina Magka; Hela Momand; Derek Dunfield; Nicola Cancedda; Pontus Stenetorp; Carole-Jean Wu; Jakob Nicolaus Foerster; Yoram Bachrach; Martin Josifoski

arXiv:2603.26499·cs.AI·April 14, 2026

AIRA_2: Overcoming Bottlenecks in AI Research Agents

Karen Hambardzumyan, Nicolas Baldwin, Edan Toledo, Rishi Hazra, Michael Kuchnik, Bassel Al Omari, Thomas Simon Foster, Anton Protopopov, Jean-Christophe Gagnon-Audet, Ishita Mediratta, Kelvin Niu, Michael Shvartsman, Alisia Lupidi, Alexis Audran-Reiss, Parth Pathak

PDF

TL;DR

AIRA$_2$ introduces architectural innovations like asynchronous multi-GPU execution, reliable evaluation protocols, and dynamic LLM operators to overcome key bottlenecks in AI research agents, significantly improving performance and scalability.

Contribution

The paper presents AIRA$_2$, a novel AI research agent architecture that addresses existing bottlenecks through three key components, leading to superior performance and scalability.

Findings

01

AIRA$_2$ achieves 81.5% percentile rank at 24 hours on MLE-bench-30.

02

AIRA$_2$ surpasses human state-of-the-art on 6 out of 20 tasks in AIRS-Bench.

03

Each architectural component of AIRA$_2$ is necessary for its performance gains.

Abstract

Existing research has identified three structural performance bottlenecks in AI research agents: (1) synchronous single-GPU execution constrains sample throughput, limiting the benefit of search; (2) a generalization gap where validation-based selection causes overfitting and performance to degrade over extended search horizons; and (3) the limited capability of fixed, single-turn LLM operators imposes a ceiling on search performance. We introduce AIRA $_{2}$ , which addresses these bottlenecks through three architectural choices: an asynchronous multi-GPU worker pool that increases experiment throughput linearly; a Hidden Consistent Evaluation protocol that delivers a reliable evaluation signal; and ReAct agents that dynamically scope their actions and debug interactively. On MLE-bench-30, AIRA $_{2}^{†}$ achieves a mean Percentile Rank of 81.5% at 24 hours and 83.1% at 72 hours,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.