ProjDevBench: Benchmarking AI Coding Agents on End-to-End Project Development

Pengrui Lu; Shiqi Zhang; Yunzhong Hou; Lyumanshan Ye; Chaoyi Huang; Zixi Chen; Ji Zeng; Hantao Jiang; Pengfei Liu; Yiwei Wang; Ming-Hsuan Yang

arXiv:2602.01655·cs.AI·February 10, 2026

ProjDevBench: Benchmarking AI Coding Agents on End-to-End Project Development

Pengrui Lu, Shiqi Zhang, Yunzhong Hou, Lyumanshan Ye, Chaoyi Huang, Zixi Chen, Ji Zeng, Hantao Jiang, Pengfei Liu, Yiwei Wang, Ming-Hsuan Yang

PDF

Open Access

TL;DR

ProjDevBench is a comprehensive benchmark for evaluating AI coding agents on end-to-end project development, assessing their ability to handle system design, correctness, and iterative refinement across diverse programming tasks.

Contribution

The paper introduces ProjDevBench, a novel end-to-end benchmark combining online judge testing and code review to evaluate AI coding agents on complex project tasks.

Findings

01

Agents achieve 27.38% acceptance rate.

02

Handle basic functionalities well but struggle with complex system design.

03

Benchmark covers diverse real-world programming scenarios.

Abstract

Recent coding agents can generate complete codebases from simple prompts, yet existing evaluations focus on issue-level bug fixing and lag behind end-to-end development. We introduce ProjDevBench, an end-to-end benchmark that provides project requirements to coding agents and evaluates the resulting repositories. Combining Online Judge (OJ) testing with LLM-assisted code review, the benchmark evaluates agents on (1) system architecture design, (2) functional correctness, and (3) iterative solution refinement. We curate 20 programming problems across 8 categories, covering both concept-oriented tasks and real-world application scenarios, and evaluate six coding agents built on different LLM backends. Our evaluation reports an overall acceptance rate of 27.38%: agents handle basic functionality and data structures but struggle with complex system design, time complexity optimization, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Artificial Intelligence in Games · Scientific Computing and Data Management