TL;DR
PeerPrism introduces a large-scale benchmark for evaluating how well LLM detection methods distinguish between human and AI contributions in peer reviews, emphasizing the complexity of hybrid human-AI collaboration.
Contribution
This work presents the first benchmark explicitly designed to disentangle idea provenance from text provenance in peer reviews, highlighting limitations of current detection methods.
Findings
Detection methods perform well on binary human vs. AI tasks.
Detectors often disagree when ideas are human but text is AI-generated.
Current methods conflate surface style with intellectual contribution.
Abstract
Large Language Models (LLMs) are increasingly used in scientific peer review, assisting with drafting, rewriting, expansion, and refinement. However, existing peer-review LLM detection methods largely treat authorship as a binary problem-human vs. AI-without accounting for the hybrid nature of modern review workflows. In practice, evaluative ideas and surface realization may originate from different sources, creating a spectrum of human-AI collaboration. In this work, we introduce PeerPrism, a large-scale benchmark of 20,690 peer reviews explicitly designed to disentangle idea provenance from text provenance. We construct controlled generation regimes spanning fully human, fully synthetic, and multiple hybrid transformations. This design enables systematic evaluation of whether detectors identify the origin of the surface text or the origin of the evaluative reasoning. We benchmark…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
