Predicting Empirical AI Research Outcomes with Language Models

Jiaxin Wen; Chenglei Si; Yueh-han Chen; He He; Shi Feng

arXiv:2506.00794·cs.AI·June 3, 2025

Predicting Empirical AI Research Outcomes with Language Models

Jiaxin Wen, Chenglei Si, Yueh-han Chen, He He, Shi Feng

PDF

Open Access

TL;DR

This paper introduces a benchmark and system using fine-tuned GPT-4.1 to predict the success of AI research ideas, outperforming human experts and other language models, thereby accelerating empirical AI research.

Contribution

The paper presents the first benchmark for predicting AI research idea success and develops a system that surpasses human experts in this task.

Findings

01

System achieves 77% accuracy on known ideas

02

Outperforms human experts (64.4% vs. 48.9%) in predicting idea success

03

System maintains robustness across various tests

Abstract

Many promising-looking ideas in AI research fail to deliver, but their validation takes substantial human labor and compute. Predicting an idea's chance of success is thus crucial for accelerating empirical AI research, a skill that even expert researchers can only acquire through substantial experience. We build the first benchmark for this task and compare LMs with human experts. Concretely, given two research ideas (e.g., two jailbreaking methods), we aim to predict which will perform better on a set of benchmarks. We scrape ideas and experimental results from conference papers, yielding 1,585 human-verified idea pairs published after our base model's cut-off date for testing, and 6,000 pairs for training. We then develop a system that combines a fine-tuned GPT-4.1 with a paper retrieval agent, and we recruit 25 human experts to compare with. In the NLP domain, our system beats human…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Topic Modeling

MethodsGPT-4 · Sparse Evolutionary Training · Balanced Selection