A Task-Level Evaluation of AI Agents in Open-Source Projects
Shojibur Rahman, Md Fazle Rabbi, Minhaz Zibran

TL;DR
This study compares five AI coding agents in open-source projects, evaluating their effectiveness across PR acceptance, review discussions, and commit quality to inform better integration in software development.
Contribution
It provides a comprehensive, task-aware evaluation of multiple AI agents using a large public dataset, highlighting their strengths and weaknesses in real-world coding tasks.
Findings
Codex has high PR acceptance rates.
Copilot triggers most review discussions.
Claude and Cursor produce higher quality commit messages.
Abstract
In this paper, we present a comparative study of five autonomous coding agents using AIDev-pop, which is a public dataset containing thousands of AI-generated pull requests (PRs) across popular open-source repositories. We evaluate agents' performance along three task-aware dimensions spanning the PR lifecycle: (1) PR acceptance rate, (2) review discussion volume, and (3) commit message quality. Our quantitative analysis finds that Codex consistently achieves high PR acceptance rates across most task categories, while Copilot's PRs trigger the highest volume of both human and automated review discussions. In contrast, commit-level quality varies independently of acceptance outcomes. Claude and Cursor produce higher proportions of high-quality commit messages across several task types, and Codex exhibiting comparatively lower commit quality despite strong integration outcomes. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMobile Crowdsensing and Crowdsourcing · Software Engineering Research · AI in Service Interactions
