CodeClash: Benchmarking Goal-Oriented Software Engineering

John Yang; Kilian Lieret; Joyce Yang; Carlos E. Jimenez; Muhtasham Oblokulov; Aryan Siddiqui; Ofir Press; Ludwig Schmidt; Diyi Yang

arXiv:2511.00839·cs.SE·May 14, 2026

CodeClash: Benchmarking Goal-Oriented Software Engineering

John Yang, Kilian Lieret, Joyce Yang, Carlos E. Jimenez, Muhtasham Oblokulov, Aryan Siddiqui, Ofir Press, Ludwig Schmidt, Diyi Yang

PDF

1 Repo

TL;DR

CodeClash is a new benchmark where language models compete in multi-round tournaments to develop codebases that achieve high-level goals, revealing their strategic limitations and challenges in long-term maintenance.

Contribution

Introduces CodeClash, a benchmark for evaluating LMs in goal-oriented, multi-round code development competitions, and provides extensive evaluation results highlighting models' strategic and maintenance limitations.

Findings

01

Models exhibit diverse development styles.

02

Models struggle with strategic reasoning and long-term maintenance.

03

Top models lose every round against expert humans.

Abstract

Current benchmarks for coding evaluate language models (LMs) on concrete, well-specified tasks such as fixing specific bugs or writing targeted tests. However, human programmers do not spend all day incessantly addressing isolated tasks. Instead, real-world software development is grounded in the pursuit of high-level goals, like improving user retention or reducing costs. Evaluating whether LMs can also iteratively develop code to better accomplish open-ended objectives without any explicit guidance remains an open challenge. To address this, we introduce CodeClash, a benchmark where LMs compete in multi-round tournaments to build the best codebase for achieving a competitive objective. Each round proceeds in two phases: agents edit their code, then their codebases compete head-to-head in a code arena that determines winners based on objectives like score maximization, resource…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

codeclash-ai/CodeClash
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.