FML-bench: A Controlled Study of AI Research Agent Strategies from the Perspective of Search Dynamics

Qiran Zou; Hou Hei Lam; Wenhao Zhao; Tingting Chen; Yiming Tang; Samson Yu; Yingtao Zhu; Srinivas Anumasa; Zufeng Zhang; Tianyi Zhang; Chang Liu; Zhengyao Jiang; Anirudh Goyal; Dianbo Liu

arXiv:2605.17373·cs.LG·May 19, 2026

FML-bench: A Controlled Study of AI Research Agent Strategies from the Perspective of Search Dynamics

Qiran Zou, Hou Hei Lam, Wenhao Zhao, Tingting Chen, Yiming Tang, Samson Yu, Yingtao Zhu, Srinivas Anumasa, Zufeng Zhang, Tianyi Zhang, Chang Liu, Zhengyao Jiang, Anirudh Goyal, Dianbo Liu

PDF

1 Repo

TL;DR

FML-Bench is a new benchmark for evaluating AI research agent strategies across diverse ML tasks, emphasizing process-level metrics to understand exploration behaviors and strategy effectiveness.

Contribution

It introduces FML-Bench, a benchmark with 18 tasks and 12 behavioral metrics that separates strategy from infrastructure, enabling detailed analysis of agent performance.

Findings

01

Simple greedy hill-climbing performs nearly as well as complex tree-search agents.

02

Adaptive strategies that switch exploration modes outperform static agents.

03

Early convergence and focused exploration correlate with higher final performance.

Abstract

AI research agents accelerate ML research by automating hypothesis generation, experimentation, and empirical refinement. Existing agent strategies range from greedy hill-climbing to tree search and evolutionary optimization, yet which strategy choices drive performance remains unclear. Answering this question requires a benchmark that separates agent strategy (e.g., search topology) from execution infrastructure (e.g., code editor), so that performance differences are attributable to strategy rather than infrastructure, and that provides process-level metrics beyond final scores to analyze exploration behaviors. Existing benchmarks offer limited support. We propose FML-Bench, a benchmark of 18 fundamental ML research tasks across 10 domains that separates agent strategy from execution infrastructure and defines 12 process-level behavioral metrics. Evaluating six representative agents,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

qrzou/FML-bench
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.