MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI

Bohan Lyu; Yucheng Yang; Siqiao Huang; Jiaru Zhang; Qixin Xu; Xinghan Li; Xinyang Han; Yicheng Zhang; Huaqing Zhang; Runhan Huang; Kaicheng Yang; Zitao Chen; Wentao Guo; Junlin Yang; Xinyue Ai; Wenhao Chai; Yadi Cao; Ziran Yang; Kun Wang; Dapeng Jiang; Huan-ang Gao; Shange Tang; Chengshuai Shi; Simon S. Du; Max Simchowitz; Jiantao Jiao; Dawn Song; Chi Jin

arXiv:2605.08678·cs.LG·May 12, 2026

MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI

Bohan Lyu, Yucheng Yang, Siqiao Huang, Jiaru Zhang, Qixin Xu, Xinghan Li, Xinyang Han, Yicheng Zhang, Huaqing Zhang, Runhan Huang, Kaicheng Yang, Zitao Chen, Wentao Guo, Junlin Yang, Xinyue Ai, Wenhao Chai, Yadi Cao, Ziran Yang, Kun Wang, Dapeng Jiang, Huan-ang Gao, Shange Tang

PDF

2 Repos 1 Datasets

TL;DR

MLS-Bench is a comprehensive benchmark designed to evaluate AI systems' ability to invent generalizable and scalable machine learning methods across diverse tasks, highlighting current limitations and guiding future research.

Contribution

The paper introduces MLS-Bench, a new benchmark with 140 tasks across 12 domains to assess AI's capacity for method invention and generalization, and provides insights into current challenges.

Findings

01

Current agents struggle to surpass human-designed methods reliably.

02

Engineering tuning is easier for agents than genuine method invention.

03

Increasing search, compute, or context alone does not overcome the bottleneck.

Abstract

Modern AI progress has been driven by ML methods that are generalizable across settings and scalable to larger regimes. As large language models demonstrate advanced capabilities in reasoning, coding, and engineering tasks, it is increasingly important to understand whether they can discover such methods rather than only apply existing ones. We introduce MLS-Bench, a benchmark for evaluating whether AI systems can invent generalizable and scalable ML methods. MLS-Bench contains 140 tasks across 12 domains, each requiring an agent to improve one targeted component of an ML system or algorithm and demonstrate that the improvement generalizes across controlled settings and scales. We find that current agents remain far from reliably surpassing human-designed methods, and that engineering-style tuning is easier for them than genuine method invention. We further study the effects of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

Bohan22/MLS-Bench-Tasks
dataset· 1.8k dl
1.8k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.