AACR-Bench: Evaluating Automatic Code Review with Holistic Repository-Level Context

Lei Zhang; Yongda Yu; Minghui Yu; Xinxin Guo; Zhengqi Zhuang; Guoping Rong; Dong Shao; Haifeng Shen; Hongyu Kuang; Zhengfeng Li; Boge Wang; Guoan Zhang; Bangyu Xiang; Xiaobin Xu

arXiv:2601.19494·cs.SE·February 2, 2026

AACR-Bench: Evaluating Automatic Code Review with Holistic Repository-Level Context

Lei Zhang, Yongda Yu, Minghui Yu, Xinxin Guo, Zhengqi Zhuang, Guoping Rong, Dong Shao, Haifeng Shen, Hongyu Kuang, Zhengfeng Li, Boge Wang, Guoan Zhang, Bangyu Xiang, Xiaobin Xu

PDF

Open Access 1 Datasets

TL;DR

AACR-Bench is a new, comprehensive benchmark for evaluating Large Language Models in Automated Code Review, supporting multiple languages and using expert-verified annotations to improve defect detection and assessment accuracy.

Contribution

Introduces AACR-Bench, a multi-language, repository-level benchmark with an AI-assisted, expert-verified annotation pipeline, enhancing defect coverage and evaluation rigor for LLM-based ACR.

Findings

01

Previous benchmarks had limited language support and noisy ground truth.

02

Model performance varies significantly with context granularity and retrieval methods.

03

Evaluation reveals that prior assessments may have misjudged LLM capabilities.

Abstract

High-quality evaluation benchmarks are pivotal for deploying Large Language Models (LLMs) in Automated Code Review (ACR). However, existing benchmarks suffer from two critical limitations: first, the lack of multi-language support in repository-level contexts, which restricts the generalizability of evaluation results; second, the reliance on noisy, incomplete ground truth derived from raw Pull Request (PR) comments, which constrains the scope of issue detection. To address these challenges, we introduce AACR-Bench a comprehensive benchmark that provides full cross-file context across multiple programming languages. Unlike traditional datasets, AACR-Bench employs an "AI-assisted, Expert-verified" annotation pipeline to uncover latent defects often overlooked in original PRs, resulting in a 285% increase in defect coverage. Extensive evaluations of mainstream LLMs on AACR-Bench reveal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Alibaba-Aone/aacr-bench
dataset· 20 dl
20 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Natural Language Processing Techniques · Topic Modeling