HindSight: Evaluating LLM-Generated Research Ideas via Future Impact

Bo Jiang

arXiv:2603.15164·cs.CL·March 18, 2026

HindSight: Evaluating LLM-Generated Research Ideas via Future Impact

Bo Jiang

PDF

Open Access

TL;DR

HindSight introduces a novel evaluation framework for AI-generated research ideas that measures future impact by comparing ideas against actual subsequent publications and citations, revealing discrepancies with traditional LLM judgments.

Contribution

This paper presents HindSight, a time-split evaluation method that objectively assesses research idea quality based on real future outcomes, addressing limitations of subjective LLM or human evaluations.

Findings

01

HindSight shows retrieval-augmented ideas have 2.5× higher impact scores.

02

LLM judges do not distinguish between retrieval-augmented and vanilla ideas.

03

LLM-judged novelty negatively correlates with actual research impact.

Abstract

Evaluating AI-generated research ideas typically relies on LLM judges or human panels -- both subjective and disconnected from actual research impact. We introduce HindSight, a time-split evaluation framework that measures idea quality by matching generated ideas against real future publications and scoring them by citation impact and venue acceptance. Using a temporal cutoff~ $T$ , we restrict an idea generation system to pre- $T$ literature, then evaluate its outputs against papers published in the subsequent 30 months. Experiments across 10 AI/ML research topics reveal a striking disconnect: LLM-as-Judge finds no significant difference between retrieval-augmented and vanilla idea generation ( $p = 0.584$ ), while HindSight shows the retrieval-augmented system produces 2.5 $\times$ higher-scoring ideas ( $p < 0.001$ ). Moreover, HindSight scores are \emph{negatively} correlated with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Scientific Computing and Data Management · Expert finding and Q&A systems