Scientific judgment drifts over time in AI ideation
Lingyu Zhang, Mitchell Wang, Boyuan Chen

TL;DR
This study reveals that human evaluations of scientific ideas are variable over time, challenging the assumption of fixed standards and highlighting the need for dynamic assessment methods in AI-assisted research.
Contribution
It demonstrates that expert judgments are not stable and that AI systems tuned to initial human ratings may not maintain alignment over time.
Findings
Expert evaluations show moderate test-retest reliability.
Internal judgment criteria remain stable over time.
Alignment to initial ratings does not persist once standards drift.
Abstract
Scientific discovery begins with ideas, yet evaluating early-stage research concepts is a subtle and subjective human judgment. As large language models (LLMs) are increasingly tasked with generating scientific hypotheses, most systems implicitly assume that scientists' evaluations form a fixed gold standard, assuming that scientists' judgments do not change. Here we challenge this assumption. In a two-wave study with 7,938 ratings from 63 active researchers across six scientific departments, each participant repeatedly evaluated a constant "control" research idea alongside AI-generated ideas. We find that expert evaluations are not stable: test-retest reliability of overall quality is only moderate (ICC~0.59-0.74), indicating substantial within-participant variability even for identical ideas. Yet the internal structure of judgment remained stable, such as the relative importance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Explainable Artificial Intelligence (XAI) · Ethics and Social Impacts of AI
