On the limits and opportunities of AI reviewers: Reviewing the reviews of Nature-family papers with 45 expert scientists

Seungone Kim; Dongkeun Yoon; Kiril Gashteovski; Juyoung Suk; Jinheon Baek; Pranjal Aggarwal; Ian Wu; Viktor Zaverkin; Spase Petkoski; Daniel R. Schrider; Ilija Dukovski; Francesco Santini; Biljana Mitreska; Yong Jeong; Kyeongha Kwon; Young Min Sim; Dragana Manasova; Arthur Porto; Biljana Mojsoska; Makoto Takamoto; Marko Shuntov; Ruoqi Liu; Hyunjoo Jenny Lee; Niyazi Ulas Din\c{c}; Yehhyun Jo; Sunkyu Han; Chungwoo Lee; Huishan Li; Esther H. R. Tsai; Ergun Simsek; Khushboo Shafi; Yeonseung Chung; Jihye Park; Aleksandar Shulevski; Henrik Christiansen; Yoosang Son; Elly Knight; Amanda Montoya; Jeongyoun Ahn; Christian Langkammer; Heera Moon; Changwon Yoon; Nikola Stikov; Mooseok Jang; Edward Choi; Junhan Kim; Yeon Sik Jung; Woo Youn Kim; Jae Kyoung Kim; Ishraq Md Anjum; Hyun Uk Kim; Drew Bridges; Carolin Lawrence; Xiang Yue; Alice Oh; Akari Asai; Sean Welleck; Graham Neubig

arXiv:2605.20668·cs.CL·May 21, 2026

On the limits and opportunities of AI reviewers: Reviewing the reviews of Nature-family papers with 45 expert scientists

Seungone Kim, Dongkeun Yoon, Kiril Gashteovski, Juyoung Suk, Jinheon Baek, Pranjal Aggarwal, Ian Wu, Viktor Zaverkin, Spase Petkoski, Daniel R. Schrider, Ilija Dukovski, Francesco Santini, Biljana Mitreska, Yong Jeong, Kyeongha Kwon, Young Min Sim, Dragana Manasova, Arthur Porto

PDF

1 Repo 1 Datasets

TL;DR

This study evaluates AI reviewers' capabilities and limitations in scientific peer review through expert annotations, revealing they outperform some humans in certain aspects but also exhibit significant weaknesses, positioning them as complementary tools.

Contribution

It provides the first large-scale, expert-annotated assessment of AI reviewers' strengths and weaknesses across multiple scientific domains.

Findings

01

AI reviewers outperform top human reviewers on a composite score.

02

AI critics identify 26% issues not raised by humans.

03

AI reviewers show more overlap and specific weaknesses like limited knowledge.

Abstract

With the advancement of AI capabilities, AI reviewers are beginning to be deployed in scientific peer review, yet their capability and credibility remain in question: many scientists simply view them as probabilistic systems without the expertise to evaluate research, while other researchers are more optimistic about their readiness without concrete evidence. Understanding what AI reviewers do well, where they fall short, and what challenges remain is essential. However, existing evaluations of AI reviewers have focused on whether their verdicts match human verdicts (e.g., score alignment, acceptance prediction), which is insufficient to characterize their capabilities and limits. In this paper, we close this gap through a large-scale expert annotation study, in which 45 domain scientists in Physical, Biological, and Health Sciences spent 469 hours rating 2,960 individual criticisms…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

prometheus-eval/cmu-paper-reviewer
github

Datasets

prometheus-eval/peerreview-bench
dataset· 178 dl
178 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.