SWE-PRBench: Benchmarking AI Code Review Quality Against Pull Request Feedback
Deepak Kumar

TL;DR
SWE-PRBench introduces a benchmark with human-annotated pull requests to evaluate AI code review quality, revealing current models detect only a small fraction of issues compared to humans.
Contribution
This work provides the first comprehensive benchmark for AI code review, systematically analyzing model performance across different context configurations and revealing significant gaps to human expertise.
Findings
AI models detect only 15-31% of human-flagged issues.
Model performance degrades with increased context complexity.
Structured diff-with-summary prompts outperform full-context prompts.
Abstract
We introduce SWE-PRBench, a benchmark of 350 pull requests with human-annotated ground truth for evaluating AI code review quality. Evaluated against an LLM-as-judge framework validated at kappa=0.75, 8 frontier models detect only 15-31% of human-flagged issues on the diff-only configuration, demonstrating that AI code review remains far below human expert performance despite strong results on code generation benchmarks. Pull requests are drawn from active open-source repositories, filtered from 700 candidates using a Repository Quality Score, and evaluated under three frozen context configurations: diff only (config_A), diff with file content (config_B), and full context (config_C), enabling systematic ablation of context provision strategies. All 8 models degrade monotonically from config_A to config_C, even when context is provided via structured semantic layers including…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
