What Makes a Good AI Review? Concern-Level Diagnostics for AI Peer Review
Ming Jin

TL;DR
This paper introduces a concern-level diagnostic framework for AI peer reviews, enabling detailed evaluation of what concerns AI systems identify, how they prioritize them, and their alignment with review rationale.
Contribution
It proposes a reusable evaluation framework using match graphs and an evaluation ladder to audit concern detection, calibration, and decision-making in AI reviews.
Findings
Detection of concerns is common but calibration is often the main constraint.
Most systems mark a high percentage of concerns as decisive, yet few treat concerns as true blockers.
Different inference methods can lead to similar verdicts but hide underlying behavior differences.
Abstract
Evaluating AI-generated reviews by verdict agreement is widely recognized as insufficient, yet current alternatives rarely audit which concerns a system identifies, how it prioritizes them, or whether those priorities align with the review rationale that shaped the final assessment. We propose concern alignment, a diagnostic framework that evaluates AI reviews at the concern level rather than only at the verdict level. The framework's core data structure is the match graph, a bipartite alignment between official and AI-generated concerns annotated with match type, severity, and post-rebuttal treatment. From this artifact we derive an evaluation ladder that moves from binary accuracy to concern detection, verdict-stratified behavior, decision-aware calibration, and rebuttal-aware decomposition. In a pilot study of four public AI review systems evaluated in six configurations,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
