From Industry Claims to Empirical Reality: An Empirical Study of Code Review Agents in Pull Requests
Kowshik Chowdhury, Dipayan Banik, K M Ferdous, Shazibul Islam Shamim

TL;DR
This study empirically evaluates the effectiveness of autonomous code review agents in pull requests, revealing that they often produce low-signal feedback leading to higher abandonment rates compared to human reviews.
Contribution
It provides the first empirical analysis of CRA review quality and its impact on PR outcomes, highlighting the importance of human oversight in automated code reviews.
Findings
CRA-only PRs have a 45.20% merge rate, lower than human-only PRs at 68.37%.
Most CRA-only PRs exhibit low signal-to-noise ratios, indicating noisy feedback.
High abandonment rates are associated with low-signal CRA feedback.
Abstract
Autonomous coding agents are generating code at an unprecedented scale, with OpenAI Codex alone creating over 400,000 pull requests (PRs) in two months. As agentic PR volumes increase, code review agents (CRAs) have become routine gatekeepers in development workflows. Industry reports claim that CRAs can manage 80% of PRs in open source repositories without human involvement. As a result, understanding the effectiveness of CRA reviews is crucial for maintaining developmental workflows and preventing wasted effort on abandoned pull requests. However, empirical evidence on how CRA feedback quality affects PR outcomes remains limited. The goal of this paper is to help researchers and practitioners understand when and how CRAs influence PR merge success by empirically analyzing reviewer composition and the signal quality of CRA-generated comments. From AIDev's 19,450 PRs, we analyze 3,109…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
