MLR-Bench: Evaluating AI Agents on Open-Ended Machine Learning Research

Hui Chen; Miao Xiong; Yujie Lu; Wei Han; Ailin Deng; Yufei He; Jiaying Wu; Yibo Li; Yue Liu; Bryan Hooi

arXiv:2505.19955·cs.LG·October 23, 2025

MLR-Bench: Evaluating AI Agents on Open-Ended Machine Learning Research

Hui Chen, Miao Xiong, Yujie Lu, Wei Han, Ailin Deng, Yufei He, Jiaying Wu, Yibo Li, Yue Liu, Bryan Hooi

PDF

Open Access 1 Repo 1 Datasets

TL;DR

MLR-Bench is a comprehensive benchmark for evaluating AI agents in open-ended machine learning research, combining diverse tasks, automated evaluation, and modular research agents to assess their ability to generate, propose, experiment, and write scientific papers.

Contribution

This work introduces MLR-Bench, a novel framework integrating research tasks, an LLM-based evaluation system, and modular research agents for systematic assessment of AI in scientific discovery.

Findings

01

LLMs effectively generate coherent ideas and structured papers.

02

Current coding agents often produce fabricated or invalid experimental results.

03

MLR-Judge shows high agreement with human experts in evaluation.

Abstract

Recent advancements in AI agents have demonstrated their growing potential to drive and support scientific discovery. In this work, we introduce MLR-Bench, a comprehensive benchmark for evaluating AI agents on open-ended machine learning research. MLR-Bench includes three key components: (1) 201 research tasks sourced from NeurIPS, ICLR, and ICML workshops covering diverse ML topics; (2) MLR-Judge, an automated evaluation framework combining LLM-based reviewers with carefully designed review rubrics to assess research quality; and (3) MLR-Agent, a modular agent scaffold capable of completing research tasks through four stages: idea generation, proposal formulation, experimentation, and paper writing. Our framework supports both stepwise assessment across these distinct research stages, and end-to-end evaluation of the final research paper. We then use MLR-Bench to evaluate six frontier…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

chchenhui/mlrbench
pytorchOfficial

Datasets

chchenhui/mlrbench-tasks
dataset· 16 dl
16 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification