Is Your Paper Being Reviewed by an LLM? Benchmarking AI Text Detection in Peer Review
Sungduk Yu, Man Luo, Avinash Madasu, Vasudev Lal, Phillip Howard

TL;DR
This paper introduces a large dataset of peer reviews, both human and AI-generated, to benchmark AI text detection methods, revealing current challenges in identifying AI-written reviews in peer review processes.
Contribution
The paper provides a comprehensive dataset of nearly 789,000 peer reviews and evaluates 18 detection algorithms, highlighting the difficulty of detecting AI-generated peer reviews and proposing a context-aware detection method.
Findings
Detection models struggle to reliably identify AI-generated reviews.
AI text detection accuracy decreases with LLM-assisted editing.
The dataset enables benchmarking and future research in AI review detection.
Abstract
Peer review is a critical process for ensuring the integrity of published scientific research. Confidence in this process is predicated on the assumption that experts in the relevant domain give careful consideration to the merits of manuscripts which are submitted for publication. With the recent rapid advancements in large language models (LLMs), a new risk to the peer review process is that negligent reviewers will rely on LLMs to perform the often time consuming process of reviewing a paper. However, there is a lack of existing resources for benchmarking the detectability of AI text in the domain of peer review. To address this deficiency, we introduce a comprehensive dataset containing a total of 788,984 AI-written peer reviews paired with corresponding human reviews, covering 8 years of papers submitted to each of two leading AI research conferences (ICLR and NeurIPS). We use this…
Peer Reviews
Decision·ICLR 2026 Poster
1. The paper shows a certain level of originality. It creates a large-scale dataset in a new domain - peer review. Existing detection datasets e.g., HC3, RAID-TD, M4, Beemo, GRiD does not cover this domain. Also, the benchmarking framework’s focus on achieving reliable detection under low false-positive constraints is also a meaningful and novel contribution, as few prior works explicitly address this important practical and ethical consideration in review integrity contexts. 2. The paper is cl
1. The proposed Anchor method offers only incremental innovation. Its core idea—measuring embedding similarity between a suspected review and an LLM-generated reference—is conceptually similar to prior work such as DetectLLM and DNA-GPT, which also compare generated and reference texts to assess authorship likelihood. 2. The dataset also relies entirely on synthetic AI reviews, which may not fully represent real-world LLM-assisted peer-review behavior. Also, if the same or related model family
1. The paper addresses a critical and emerging problem in scientific peer review, the detectability of LLM-generated reviews. 2. The constructed dataset is large-scale and can be useful for the ongoing studies of AI text detection. The authors benchmarks 18 existing AI text detection methods, showing that most existing detection methods struggle significantly at low false positive rates. 3. I like how authors study the differences between human and AI-written reviews. This could be very helpfu
1. One potential explanation for the poor performance is that peer review text may be out-of-distribution for these detection models, have the authors considered evaluating whether fine-tuning existing detection methods on peer review data could substantially improve detectability? 2. Although the authors tested the robustness of their methods across different prompts and pipelines, they all follow the same paradigm: given a paper, generate reviews on it. However, in reality, this may not be ho
1. This paper targets at an ethical issues, detection of peer reviews from LLMs. Existing benchmarks mainly focus on general AI text detection, while this paper provides a new dataset for specific peer reviews, offering a useful benchmark for this problem. 2. In the evaluation part, the paper considers various aspects to analyze the problem, among which different prompts and LLM-assisted editing are meaningful. Furthermore, authors also discuss different types and parts of peer reviews, such as
1. There might be deviation in distribution of words because different papers have special domain biases. The paper lacks analysis of such diversity, which tends to amplify the differences between LLM-generated texts and human-written reviews. Whether usage of human-written reviews in a specific domain will improve the quality of AI-written reviews for similar papers has not been discussed here. 2. Despite "AI peer review detection robust to prompt variations" is described in section 5.2, four r
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
