SWE-PRBench: Benchmarking AI Code Review Quality Against Pull Request Feedback

Deepak Kumar

arXiv:2603.26130·cs.SE·March 30, 2026

SWE-PRBench: Benchmarking AI Code Review Quality Against Pull Request Feedback

Deepak Kumar

PDF

TL;DR

SWE-PRBench introduces a benchmark with human-annotated pull requests to evaluate AI code review quality, revealing current models detect only a small fraction of issues compared to humans.

Contribution

This work provides the first comprehensive benchmark for AI code review, systematically analyzing model performance across different context configurations and revealing significant gaps to human expertise.

Findings

01

AI models detect only 15-31% of human-flagged issues.

02

Model performance degrades with increased context complexity.

03

Structured diff-with-summary prompts outperform full-context prompts.

Abstract

We introduce SWE-PRBench, a benchmark of 350 pull requests with human-annotated ground truth for evaluating AI code review quality. Evaluated against an LLM-as-judge framework validated at kappa=0.75, 8 frontier models detect only 15-31% of human-flagged issues on the diff-only configuration, demonstrating that AI code review remains far below human expert performance despite strong results on code generation benchmarks. Pull requests are drawn from active open-source repositories, filtered from 700 candidates using a Repository Quality Score, and evaluated under three frozen context configurations: diff only (config_A), diff with file content (config_B), and full context (config_C), enabling systematic ablation of context provision strategies. All 8 models degrade monotonically from config_A to config_C, even when context is provided via structured semantic layers including…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.