AI Research Agents for Machine Learning: Search, Exploration, and Generalization in MLE-bench

Edan Toledo; Karen Hambardzumyan; Martin Josifoski; Rishi Hazra; Nicolas Baldwin; Alexis Audran-Reiss; Michael Kuchnik; Despoina Magka; Minqi Jiang; Alisia Maria Lupidi; Andrei Lupu; Roberta Raileanu; Kelvin Niu; Tatiana Shavrina; Jean-Christophe Gagnon-Audet; Michael Shvartsman; Shagun Sodhani; Alexander H. Miller; Abhishek Charnalia; Derek Dunfield; Carole-Jean Wu; Pontus Stenetorp; Nicola Cancedda; Jakob Nicolaus Foerster; Yoram Bachrach

arXiv:2507.02554·cs.AI·November 5, 2025

AI Research Agents for Machine Learning: Search, Exploration, and Generalization in MLE-bench

Edan Toledo, Karen Hambardzumyan, Martin Josifoski, Rishi Hazra, Nicolas Baldwin, Alexis Audran-Reiss, Michael Kuchnik, Despoina Magka, Minqi Jiang, Alisia Maria Lupidi, Andrei Lupu, Roberta Raileanu, Kelvin Niu, Tatiana Shavrina, Jean-Christophe Gagnon-Audet, Michael Shvartsman

PDF

1 Repo

TL;DR

This paper explores how different search strategies and operator sets in AI research agents can improve performance on MLE-bench, a challenging benchmark for automating machine learning tasks, achieving state-of-the-art results.

Contribution

It systematically analyzes the interplay of search policies and operators, demonstrating their combined impact on agent performance in automated machine learning.

Findings

01

Best pairing increased success rate from 39.6% to 47.7%.

02

Interplay of search strategy and operators is critical for high performance.

03

Joint consideration of search, operators, and evaluation advances automated ML.

Abstract

AI research agents are demonstrating great potential to accelerate scientific progress by automating the design, implementation, and training of machine learning models. We focus on methods for improving agents' performance on MLE-bench, a challenging benchmark where agents compete in Kaggle competitions to solve real-world machine learning problems. We formalize AI research agents as search policies that navigate a space of candidate solutions, iteratively modifying them using operators. By designing and systematically varying different operator sets and search policies (Greedy, MCTS, Evolutionary), we show that their interplay is critical for achieving high performance. Our best pairing of search strategy and operator set achieves a state-of-the-art result on MLE-bench lite, increasing the success rate of achieving a Kaggle medal from 39.6% to 47.7%. Our investigation underscores the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

facebookresearch/aira-dojo
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSparse Evolutionary Training · Focus