TL;DR
CoCoReviewBench is a new benchmark dataset designed to evaluate AI reviewers' completeness and correctness by curating 3,900 papers and leveraging expert annotations, addressing limitations of existing metrics.
Contribution
The paper introduces CoCoReviewBench, a comprehensive benchmark with curated datasets and evaluation strategies to improve AI reviewer assessment.
Findings
AI reviewers show limited correctness and hallucination issues.
Reasoning models outperform other AI reviewer types.
Benchmark enables more reliable and fine-grained evaluation.
Abstract
Despite the rapid development of AI reviewers, evaluating such systems remains challenging: metrics favor overlap with human reviews over correctness. However, since human reviews often cover only a subset of salient issues and sometimes contain mistakes, they are unreliable as gold references. To address this, we build category-specific benchmark subsets and skip evaluation when the corresponding human reviews are missing to strengthen Completeness. We also leverage reviewer--author--meta-review discussions as expert annotations and filter unreliable reviews accordingly to strengthen Correctness. Finally, we introduce CoCoReviewBench, which curates 3,900 papers from ICLR and NeurIPS to enable reliable and fine-grained evaluation of AI reviewers. Analysis shows that AI reviewers remain limited in correctness and are prone to hallucinations, and highlights reasoning models as more…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
