Proof of Time: A Benchmark for Evaluating Scientific Idea Judgments

Bingyang Ye; Shan Chen; Jingxuan Tu; Chen Liu; Zidi Xiong; Samuel Schmidgall; Danielle S. Bitterman

arXiv:2601.07606·cs.CL·January 13, 2026

Proof of Time: A Benchmark for Evaluating Scientific Idea Judgments

Bingyang Ye, Shan Chen, Jingxuan Tu, Chen Liu, Zidi Xiong, Samuel Schmidgall, Danielle S. Bitterman

PDF

Open Access 1 Datasets

TL;DR

PoT is a benchmarking framework that evaluates scientific idea judgments by linking them to future observable signals, enabling scalable, verifiable assessment of models' forecasting abilities in scientific research.

Contribution

Introduces PoT, a semi-verifiable benchmark linking scientific idea judgments to future signals, facilitating scalable evaluation of models and agents in scientific forecasting tasks.

Findings

01

Higher interaction budgets improve agent performance.

02

Tool use benefits are task-dependent.

03

PoT enables scalable, future-verifiable evaluation.

Abstract

Large language models are increasingly being used to assess and forecast research ideas, yet we lack scalable ways to evaluate the quality of models' judgments about these scientific ideas. Towards this goal, we introduce PoT, a semi-verifiable benchmarking framework that links scientific idea judgments to downstream signals that become observable later (e.g., citations and shifts in researchers' agendas). PoT freezes a pre-cutoff snapshot of evidence in an offline sandbox and asks models to forecast post-cutoff outcomes, enabling verifiable evaluation when ground truth arrives, scalable benchmarking without exhaustive expert annotation, and analysis of human-model misalignment against signals such as peer-review awards. In addition, PoT provides a controlled testbed for agent-based research judgments that evaluate scientific ideas, comparing tool-using agents to non-agent baselines…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

AIM-Harvard/proof-of-time
dataset· 18 dl
18 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Computational and Text Analysis Methods · Machine Learning in Materials Science