A Task-Level Evaluation of AI Agents in Open-Source Projects

Shojibur Rahman; Md Fazle Rabbi; Minhaz Zibran

arXiv:2602.02345·cs.SE·February 3, 2026

A Task-Level Evaluation of AI Agents in Open-Source Projects

Shojibur Rahman, Md Fazle Rabbi, Minhaz Zibran

PDF

Open Access

TL;DR

This study compares five AI coding agents in open-source projects, evaluating their effectiveness across PR acceptance, review discussions, and commit quality to inform better integration in software development.

Contribution

It provides a comprehensive, task-aware evaluation of multiple AI agents using a large public dataset, highlighting their strengths and weaknesses in real-world coding tasks.

Findings

01

Codex has high PR acceptance rates.

02

Copilot triggers most review discussions.

03

Claude and Cursor produce higher quality commit messages.

Abstract

In this paper, we present a comparative study of five autonomous coding agents using AIDev-pop, which is a public dataset containing thousands of AI-generated pull requests (PRs) across popular open-source repositories. We evaluate agents' performance along three task-aware dimensions spanning the PR lifecycle: (1) PR acceptance rate, (2) review discussion volume, and (3) commit message quality. Our quantitative analysis finds that Codex consistently achieves high PR acceptance rates across most task categories, while Copilot's PRs trigger the highest volume of both human and automated review discussions. In contrast, commit-level quality varies independently of acceptance outcomes. Claude and Cursor produce higher proportions of high-quality commit messages across several task types, and Codex exhibiting comparatively lower commit quality despite strong integration outcomes. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMobile Crowdsensing and Crowdsourcing · Software Engineering Research · AI in Service Interactions