Automatically Benchmarking LLM Code Agents through Agent-Driven Annotation and Evaluation
Lingyue Fu, Bolun Zhang, Hao Guan, Yaoming Zhu, Lin Qiu, Weiwen Liu, Xuezhi Cao, Xunliang Cai, Weinan Zhang, Yong Yu

TL;DR
This paper introduces PRDBench, a diverse, project-level Python code benchmark constructed via agent-driven annotation, and a specialized, fine-tuned evaluation model achieving high human alignment for assessing code agents.
Contribution
It presents a novel agent-driven pipeline for creating diverse benchmarks and a dedicated, fine-tuned evaluation model to improve assessment accuracy of code agents.
Findings
PRDBench includes 50 real-world Python projects across 20 domains.
The specialized PRDJudge achieves over 90% human alignment.
The framework offers scalable, robust, and accurate evaluation of code agents.
Abstract
Recent advances in code agents have enabled automated software development at the project level, supported by large language models (LLMs). However, existing benchmarks for code agent evaluation face two major limitations. First, creating high-quality project-level evaluation datasets requires extensive domain expertise, leading to prohibitive annotation costs and limited diversity. Second, while recent Agent-as-a-Judge paradigms address the rigidity of traditional unit tests by enabling flexible metrics, their reliance on In-Context Learning (ICL) with general LLMs often results in inaccurate assessments that misalign with human standards. To address these challenges, we propose an agent-driven benchmark construction pipeline that leverages human supervision to efficiently generate diverse project-level tasks. Based on this, we introduce PRDBench, comprising 50 real-world Python…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
