InnovatorBench: Evaluating Agents' Ability to Conduct Innovative LLM Research

Yunze Wu; Dayuan Fu; Weiye Si; Zhen Huang; Mohan Jiang; Keyu Li; Shijie Xia; Jie Sun; Tianze Xu; Xiangkun Hu; Pengrui Lu; Xiaojie Cai; Lyumanshan Ye; Wenhong Zhu; Yang Xiao; Pengfei Liu

arXiv:2510.27598·cs.AI·November 4, 2025

InnovatorBench: Evaluating Agents' Ability to Conduct Innovative LLM Research

Yunze Wu, Dayuan Fu, Weiye Si, Zhen Huang, Mohan Jiang, Keyu Li, Shijie Xia, Jie Sun, Tianze Xu, Xiangkun Hu, Pengrui Lu, Xiaojie Cai, Lyumanshan Ye, Wenhong Zhu, Yang Xiao, Pengfei Liu

PDF

Open Access

TL;DR

InnovatorBench is a comprehensive benchmark platform designed to evaluate AI agents' capabilities in conducting end-to-end large language model research tasks, highlighting current limitations and potential areas for improvement.

Contribution

The paper introduces InnovatorBench and ResearchGym, enabling realistic assessment of LLM research agents and revealing the challenges frontier models face in complex, long-horizon tasks.

Findings

01

Frontier models show promise in code-driven tasks.

02

Models struggle with fragile algorithms and long-horizon decision making.

03

Agents need over 11 hours to reach peak performance on the benchmark.

Abstract

AI agents could accelerate scientific discovery by automating hypothesis formation, experiment design, coding, execution, and analysis, yet existing benchmarks probe narrow skills in simplified settings. To address this gap, we introduce InnovatorBench, a benchmark-platform pair for realistic, end-to-end assessment of agents performing Large Language Model (LLM) research. It comprises 20 tasks spanning Data Construction, Filtering, Augmentation, Loss Design, Reward Design, and Scaffold Construction, which require runnable artifacts and assessment of correctness, performance, output quality, and uncertainty. To support agent operation, we develop ResearchGym, a research environment offering rich action spaces, distributed and long-horizon execution, asynchronous monitoring, and snapshot saving. We also implement a lightweight ReAct agent that couples explicit reasoning with executable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Machine Learning in Materials Science · Topic Modeling