CR-Bench: Evaluating the Real-World Utility of AI Code Review Agents
Kristen Pereira, Neelabh Sinha, Rajat Ghosh, Debojyoti Dutta

TL;DR
CR-Bench and CR-Evaluator provide standardized benchmarks and evaluation protocols for AI code review agents, revealing challenges in balancing issue detection and false positives in real-world scenarios.
Contribution
Introduction of CR-Bench and CR-Evaluator for granular assessment of AI code review agents, enabling better understanding of their behavior and limitations.
Findings
Code review agents show low signal-to-noise ratio in identifying issues.
Trade-off exists between issue resolution and spurious findings.
Evaluation reveals constraints in effective agent design.
Abstract
Recent advances in frontier large language models have enabled code review agents that operate in open-ended, reasoning-intensive settings. However, the lack of standardized benchmarks and granular evaluation protocols makes it difficult to assess behavior of code review agents beyond coarse success metrics, particularly for tasks where false positives are costly. To address this gap, we introduce CR-Bench, a benchmarking dataset, and CR-Evaluator, a fine-grained evaluation pipeline for code review agents. Using these tools, we conduct a preliminary study evaluating both a single-shot agent and a Reflexion-based agent across two frontier models. We find that code review agents can exhibit a low signal-to-noise ratio when designed to identify all hidden issues, obscuring true progress and developer productivity when measured solely by resolution rates. Our analysis identifies the hidden…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software Engineering Techniques and Practices · Topic Modeling
