Comparing AI Coding Agents: A Task-Stratified Analysis of Pull Request Acceptance
Giovanni Pinna, Jingzhi Gong, David Williams, Federica Sarro

TL;DR
This empirical study compares five AI coding agents across various task types and over time, revealing that task type significantly influences acceptance rates and that no single agent is best for all tasks.
Contribution
The paper provides a systematic, task-stratified comparison of AI coding agents, highlighting their performance differences and temporal trends in pull request acceptance.
Findings
Devin shows a consistent positive acceptance trend (+0.77% per week).
Documentation tasks have an 82.1% acceptance rate, higher than 66.1% for new features.
OpenAI Codex has high acceptance rates across all task categories.
Abstract
The rapid adoption of AI-powered coding assistants is transforming software development practices, yet systematic comparisons of their effectiveness across different task types and over time remain limited. This paper presents an empirical study comparing five popular agents (OpenAI Codex, GitHub Copilot, Devin, Cursor, and Claude Code), analyzing 7,156 pull requests (PRs) from the AIDev dataset. Temporal trend analysis reveals heterogeneous evolution patterns: Devin exhibits the only consistent positive trend in acceptance rate (+0.77% per week over 32 weeks), whereas other agents remain largely stable. Our analysis suggests that the PR task type is a dominant factor influencing acceptance rates: documentation tasks achieve 82.1% acceptance compared to 66.1% for new features - a 16 percentage point gap that exceeds typical inter-agent variance for most tasks. OpenAI Codex achieves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
