FML-bench: A Controlled Study of AI Research Agent Strategies from the Perspective of Search Dynamics
Qiran Zou, Hou Hei Lam, Wenhao Zhao, Tingting Chen, Yiming Tang, Samson Yu, Yingtao Zhu, Srinivas Anumasa, Zufeng Zhang, Tianyi Zhang, Chang Liu, Zhengyao Jiang, Anirudh Goyal, Dianbo Liu

TL;DR
FML-Bench is a new benchmark for evaluating AI research agent strategies across diverse ML tasks, emphasizing process-level metrics to understand exploration behaviors and strategy effectiveness.
Contribution
It introduces FML-Bench, a benchmark with 18 tasks and 12 behavioral metrics that separates strategy from infrastructure, enabling detailed analysis of agent performance.
Findings
Simple greedy hill-climbing performs nearly as well as complex tree-search agents.
Adaptive strategies that switch exploration modes outperform static agents.
Early convergence and focused exploration correlate with higher final performance.
Abstract
AI research agents accelerate ML research by automating hypothesis generation, experimentation, and empirical refinement. Existing agent strategies range from greedy hill-climbing to tree search and evolutionary optimization, yet which strategy choices drive performance remains unclear. Answering this question requires a benchmark that separates agent strategy (e.g., search topology) from execution infrastructure (e.g., code editor), so that performance differences are attributable to strategy rather than infrastructure, and that provides process-level metrics beyond final scores to analyze exploration behaviors. Existing benchmarks offer limited support. We propose FML-Bench, a benchmark of 18 fundamental ML research tasks across 10 domains that separates agent strategy from execution infrastructure and defines 12 process-level behavioral metrics. Evaluating six representative agents,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
