ProxyWar: Dynamic Assessment of LLM Code Generation in Game Arenas

Wenjun Peng; Xinyu Wang; Qi Wu

arXiv:2602.04296·cs.SE·February 5, 2026

ProxyWar: Dynamic Assessment of LLM Code Generation in Game Arenas

Wenjun Peng, Xinyu Wang, Qi Wu

PDF

Open Access

TL;DR

ProxyWar introduces a dynamic, game-based framework for evaluating LLM-generated code by embedding agents in competitive environments, revealing limitations of static benchmarks and guiding future improvements.

Contribution

It presents a novel, holistic evaluation method combining automated testing, iterative repair, and multi-agent tournaments for assessing LLM code in dynamic settings.

Findings

01

Benchmark scores often do not reflect real-world performance.

02

Dynamic evaluation uncovers limitations of current LLM code generators.

03

The framework enables research into adaptive problem solving and algorithm discovery.

Abstract

Large language models (LLMs) have revolutionized automated code generation, yet the evaluation of their real-world effectiveness remains limited by static benchmarks and simplistic metrics. We present ProxyWar, a novel framework that systematically assesses code generation quality by embedding LLM-generated agents within diverse, competitive game environments. Unlike existing approaches, ProxyWar evaluates not only functional correctness but also the operational characteristics of generated programs, combining automated testing, iterative code repair, and multi-agent tournaments to provide a holistic view of program behavior. Applied to a range of state-of-the-art coders and games, our approach uncovers notable discrepancies between benchmark scores and actual performance in dynamic settings, revealing overlooked limitations and opportunities for improvement. These findings highlight…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Games · Software Engineering Research · Topic Modeling