Beyond Rating: A Comprehensive Evaluation and Benchmark for AI Reviews
Bowen Li, Haochen Ma, Yuxin Wang, Jie Yang, Yining Zheng, Xinchi Chen, Xuanjing Huang, Xipeng Qiu

TL;DR
This paper introduces Beyond Rating, a comprehensive evaluation framework for AI peer review that emphasizes textual justification over scalar scores, and proposes new metrics and datasets to improve alignment with human judgment.
Contribution
It presents a holistic evaluation framework and novel metrics for AI reviews, emphasizing textual content and argumentative quality, moving beyond traditional rating prediction benchmarks.
Findings
Text-centric metrics, especially recall of weakness arguments, correlate strongly with human preferences.
Traditional n-gram metrics fail to reflect human evaluation of reviews.
Aligning AI critique focus with human experts improves automated scoring reliability.
Abstract
The rapid adoption of Large Language Models (LLMs) has spurred interest in automated peer review; however, progress is currently stifled by benchmarks that treat reviewing primarily as a rating prediction task. We argue that the utility of a review lies in its textual justification--its arguments, questions, and critique--rather than a scalar score. To address this, we introduce Beyond Rating, a holistic evaluation framework that assesses AI reviewers across five dimensions: Content Faithfulness, Argumentative Alignment, Focus Consistency, Question Constructiveness, and AI-Likelihood. Notably, we propose a Max-Recall strategy to accommodate valid expert disagreement and introduce a curated dataset of paper with high-confidence reviews, rigorously filtered to remove procedural noise. Extensive experiments demonstrate that while traditional n-gram metrics fail to reflect human…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
