Beyond Rating: A Comprehensive Evaluation and Benchmark for AI Reviews

Bowen Li; Haochen Ma; Yuxin Wang; Jie Yang; Yining Zheng; Xinchi Chen; Xuanjing Huang; Xipeng Qiu

arXiv:2604.19502·cs.CL·April 23, 2026

Beyond Rating: A Comprehensive Evaluation and Benchmark for AI Reviews

Bowen Li, Haochen Ma, Yuxin Wang, Jie Yang, Yining Zheng, Xinchi Chen, Xuanjing Huang, Xipeng Qiu

PDF

TL;DR

This paper introduces Beyond Rating, a comprehensive evaluation framework for AI peer review that emphasizes textual justification over scalar scores, and proposes new metrics and datasets to improve alignment with human judgment.

Contribution

It presents a holistic evaluation framework and novel metrics for AI reviews, emphasizing textual content and argumentative quality, moving beyond traditional rating prediction benchmarks.

Findings

01

Text-centric metrics, especially recall of weakness arguments, correlate strongly with human preferences.

02

Traditional n-gram metrics fail to reflect human evaluation of reviews.

03

Aligning AI critique focus with human experts improves automated scoring reliability.

Abstract

The rapid adoption of Large Language Models (LLMs) has spurred interest in automated peer review; however, progress is currently stifled by benchmarks that treat reviewing primarily as a rating prediction task. We argue that the utility of a review lies in its textual justification--its arguments, questions, and critique--rather than a scalar score. To address this, we introduce Beyond Rating, a holistic evaluation framework that assesses AI reviewers across five dimensions: Content Faithfulness, Argumentative Alignment, Focus Consistency, Question Constructiveness, and AI-Likelihood. Notably, we propose a Max-Recall strategy to accommodate valid expert disagreement and introduce a curated dataset of paper with high-confidence reviews, rigorously filtered to remove procedural noise. Extensive experiments demonstrate that while traditional n-gram metrics fail to reflect human…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.