HindSight: Evaluating LLM-Generated Research Ideas via Future Impact
Bo Jiang

TL;DR
HindSight introduces a novel evaluation framework for AI-generated research ideas that measures future impact by comparing ideas against actual subsequent publications and citations, revealing discrepancies with traditional LLM judgments.
Contribution
This paper presents HindSight, a time-split evaluation method that objectively assesses research idea quality based on real future outcomes, addressing limitations of subjective LLM or human evaluations.
Findings
HindSight shows retrieval-augmented ideas have 2.5× higher impact scores.
LLM judges do not distinguish between retrieval-augmented and vanilla ideas.
LLM-judged novelty negatively correlates with actual research impact.
Abstract
Evaluating AI-generated research ideas typically relies on LLM judges or human panels -- both subjective and disconnected from actual research impact. We introduce HindSight, a time-split evaluation framework that measures idea quality by matching generated ideas against real future publications and scoring them by citation impact and venue acceptance. Using a temporal cutoff~, we restrict an idea generation system to pre- literature, then evaluate its outputs against papers published in the subsequent 30 months. Experiments across 10 AI/ML research topics reveal a striking disconnect: LLM-as-Judge finds no significant difference between retrieval-augmented and vanilla idea generation (), while HindSight shows the retrieval-augmented system produces 2.5 higher-scoring ideas (). Moreover, HindSight scores are \emph{negatively} correlated with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Scientific Computing and Data Management · Expert finding and Q&A systems
