Stop Automating Peer Review Without Rigorous Evaluation
Joachim Baumann, Jiaxin Pei, Sanmi Koyejo, Dirk Hovy

TL;DR
This paper critically examines the use of large language models for peer review, highlighting issues like reduced diversity and gameability, and advocates for rigorous evaluation before automation adoption.
Contribution
It provides an empirical comparison of human and AI reviews, demonstrating the limitations and risks of deploying LLMs in peer review without thorough validation.
Findings
AI reviewers show excessive agreement, reducing review diversity.
AI review scores can be easily manipulated through paper rewriting.
Automating peer review requires a scientific approach, not just deploying LLMs.
Abstract
Large language models offer a tempting solution to address the peer review crisis. This position paper argues that today's AI systems should not be used to produce paper reviews. We ground this position in an empirical comparison of human- versus AI-generated ICLR 2026 reviews and an evaluation of the effect of automated paper rewriting on different AI reviewers. We identify two critical issues: 1) AI reviewers exhibit a hivemind effect of excessive agreement within and across papers that reduces perspective diversity. 2) AI review scores are trivially gameable through paper laundering: prompting an LLM to rewrite a paper could significantly increase the scores from AI reviewers, demonstrating that LLM reviewers are easy to game through stylistic changes rather than scientific results. However, non-gameability and review diversity are necessary but not sufficient conditions for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
