Evaluation-driven Scaling for Scientific Discovery

Haotian Ye; Haowei Lin; Jingyi Tang; Yizhen Luo; Caiyin Yang; Chang Su; Rahul Thapa; Rui Yang; Ruihua Liu; Zeyu Li; Chong Gao; Dachao Ding; Guangrong He; Miaolei Zhang; Lina Sun; Wenyang Wang; Yuchen Zhong; Zhuohao Shen; Di He; Jianzhu Ma; Stefano Ermon; Tongyang Li; Xiaowen Chu; James Zou; Yuzhi Xu

arXiv:2604.19341·cs.LG·April 22, 2026

Evaluation-driven Scaling for Scientific Discovery

Haotian Ye, Haowei Lin, Jingyi Tang, Yizhen Luo, Caiyin Yang, Chang Su, Rahul Thapa, Rui Yang, Ruihua Liu, Zeyu Li, Chong Gao, Dachao Ding, Guangrong He, Miaolei Zhang, Lina Sun, Wenyang Wang, Yuchen Zhong, Zhuohao Shen, Di He, Jianzhu Ma, Stefano Ermon, Tongyang Li, Xiaowen Chu

PDF

TL;DR

This paper introduces SimpleTES, a framework for scaling evaluation-driven scientific discovery loops, leading to state-of-the-art solutions across diverse domains and improving model efficiency and generalization.

Contribution

The paper presents SimpleTES, a scalable framework that combines exploration, feedback, and selection to enhance scientific discovery with language models.

Findings

01

Discovered state-of-the-art solutions in 21 scientific problems across six domains.

02

Speeded up LASSO algorithm by over 2x.

03

Reduced quantum circuit gate overhead by 24.5%.

Abstract

Language models are increasingly used in scientific discovery to generate hypotheses, propose candidate solutions, implement systems, and iteratively refine them. At the core of these trial-and-error loops lies evaluation: the process of obtaining feedback on candidate solutions via verifiers, simulators, or task-specific scoring functions. While prior work has highlighted the importance of evaluation, it has not explicitly formulated the problem of how evaluation-driven discovery loops can be scaled up in a principled and effective manner to push the boundaries of scientific discovery, a problem this paper seeks to address. We introduce Simple Test-time Evaluation-driven Scaling (SimpleTES), a general framework that strategically combines parallel exploration, feedback-driven refinement, and local selection, revealing substantial gains unlocked by scaling evaluation-driven discovery…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.