CATArena: Evaluating Evolutionary Capabilities of Code Agents via Iterative Tournaments
Lingyue Fu, Xin Ding, Linyue Pan, Yaoming Zhu, Shao Zhang, Lin Qiu, Weiwen Liu, Weinan Zhang, Xuezhi Cao, Xunliang Cai, Jiaxin Ding, Yong Yu

TL;DR
CATArena is a new framework for evaluating the iterative and evolutionary development capabilities of LLM code agents through multi-turn tournaments, highlighting their potential for continuous improvement.
Contribution
We introduce CATArena, a comprehensive evaluation framework that measures the evolutionary potential of code agents via iterative tournaments and dual-metric assessment.
Findings
Evolutionary potential is not strictly linked to initial proficiency.
Current agents struggle to effectively combine peer-learning and self-reflection.
CATArena demonstrates high extensibility and robustness across tasks.
Abstract
Current evaluation for Large Language Model (LLM) code agents predominantly focus on generating functional code in single-turn scenarios, which fails to evaluate the agent's capability for continuous code optimization and multi-turn iterative development. To bridge this gap, we introduce CATArena, a framework designed to evaluate the evolutionary capabilities of code agents via iterative tournaments. Agents engage in multi-turn tournaments and continuously refine their code through self-reflection and peer-learning based on comprehensive execution feedback. For evaluation, we propose a dual-metric system to decouple static generation proficiency from evolutionary potential. Extensive experiments reveal that an agent's evolutionary potential is not strictly correlated with its initial proficiency. Our analysis further reveals that current agents struggle to concurrently leverage both…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Topic Modeling · Natural Language Processing Techniques
