TL;DR
CodeClash is a new benchmark where language models compete in multi-round tournaments to develop codebases that achieve high-level goals, revealing their strategic limitations and challenges in long-term maintenance.
Contribution
Introduces CodeClash, a benchmark for evaluating LMs in goal-oriented, multi-round code development competitions, and provides extensive evaluation results highlighting models' strategic and maintenance limitations.
Findings
Models exhibit diverse development styles.
Models struggle with strategic reasoning and long-term maintenance.
Top models lose every round against expert humans.
Abstract
Current benchmarks for coding evaluate language models (LMs) on concrete, well-specified tasks such as fixing specific bugs or writing targeted tests. However, human programmers do not spend all day incessantly addressing isolated tasks. Instead, real-world software development is grounded in the pursuit of high-level goals, like improving user retention or reducing costs. Evaluating whether LMs can also iteratively develop code to better accomplish open-ended objectives without any explicit guidance remains an open challenge. To address this, we introduce CodeClash, a benchmark where LMs compete in multi-round tournaments to build the best codebase for achieving a competitive objective. Each round proceeds in two phases: agents edit their code, then their codebases compete head-to-head in a code arena that determines winners based on objectives like score maximization, resource…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
