Teaching Language Models to Forecast Research Success Through Comparative Idea Evaluation
Srujan P Mule, Aniketh Garikaparthi, Manasi Patwardhan

TL;DR
This paper explores training language models to predict the success of research ideas before experiments, using a large dataset and novel reasoning methods, to improve scientific discovery efficiency.
Contribution
It introduces a dataset for empirical forecasting of research ideas and demonstrates that small language models can effectively predict research success with interpretability.
Findings
SFT improves accuracy from 30% to 77.1%.
Reinforcement Learning with Verifiable Rewards achieves 71.35% accuracy.
Models transfer well across domains and time splits.
Abstract
As language models accelerate scientific research by automating hypothesis generation and implementation, a new bottleneck emerges: evaluating and filtering hundreds of AI-generated ideas without exhaustive experimentation. We ask whether LMs can learn to forecast the empirical success of research ideas before any experiments are run. We study comparative empirical forecasting: given a benchmark-specific research goal and two candidate ideas, predict which will achieve better benchmark performance. We construct a dataset of 11,488 idea pairs grounded in objective outcomes from PapersWithCode. While off-the-shelf 8B-parameter models struggle (30% acc.), SFT dramatically boosts performance to 77.1%, outperforming GPT-5 (61.1%). By framing evaluation as a reasoning task via Reinforcement Learning with Verifiable Rewards (RLVR), we train models to discover latent reasoning paths, achieving…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
