CoCoReviewBench: A Completeness- and Correctness-Oriented Benchmark for AI Reviewers

Hexuan Deng; Xiaopeng Ke; Yichen Li; Ruina Hu; Dehao Huang; Derek F. Wong; Yue Wang; Xuebo Liu; Min Zhang

arXiv:2605.07905·cs.CL·May 19, 2026

CoCoReviewBench: A Completeness- and Correctness-Oriented Benchmark for AI Reviewers

Hexuan Deng, Xiaopeng Ke, Yichen Li, Ruina Hu, Dehao Huang, Derek F. Wong, Yue Wang, Xuebo Liu, Min Zhang

PDF

1 Repo

TL;DR

CoCoReviewBench is a new benchmark dataset designed to evaluate AI reviewers' completeness and correctness by curating 3,900 papers and leveraging expert annotations, addressing limitations of existing metrics.

Contribution

The paper introduces CoCoReviewBench, a comprehensive benchmark with curated datasets and evaluation strategies to improve AI reviewer assessment.

Findings

01

AI reviewers show limited correctness and hallucination issues.

02

Reasoning models outperform other AI reviewer types.

03

Benchmark enables more reliable and fine-grained evaluation.

Abstract

Despite the rapid development of AI reviewers, evaluating such systems remains challenging: metrics favor overlap with human reviews over correctness. However, since human reviews often cover only a subset of salient issues and sometimes contain mistakes, they are unreliable as gold references. To address this, we build category-specific benchmark subsets and skip evaluation when the corresponding human reviews are missing to strengthen Completeness. We also leverage reviewer--author--meta-review discussions as expert annotations and filter unreliable reviews accordingly to strengthen Correctness. Finally, we introduce CoCoReviewBench, which curates 3,900 papers from ICLR and NeurIPS to enable reliable and fine-grained evaluation of AI reviewers. Analysis shows that AI reviewers remain limited in correctness and are prone to hallucinations, and highlights reasoning models as more…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hexuandeng/CoCoReviewBench
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.