ProxyWar: Dynamic Assessment of LLM Code Generation in Game Arenas
Wenjun Peng, Xinyu Wang, Qi Wu

TL;DR
ProxyWar introduces a dynamic, game-based framework for evaluating LLM-generated code by embedding agents in competitive environments, revealing limitations of static benchmarks and guiding future improvements.
Contribution
It presents a novel, holistic evaluation method combining automated testing, iterative repair, and multi-agent tournaments for assessing LLM code in dynamic settings.
Findings
Benchmark scores often do not reflect real-world performance.
Dynamic evaluation uncovers limitations of current LLM code generators.
The framework enables research into adaptive problem solving and algorithm discovery.
Abstract
Large language models (LLMs) have revolutionized automated code generation, yet the evaluation of their real-world effectiveness remains limited by static benchmarks and simplistic metrics. We present ProxyWar, a novel framework that systematically assesses code generation quality by embedding LLM-generated agents within diverse, competitive game environments. Unlike existing approaches, ProxyWar evaluates not only functional correctness but also the operational characteristics of generated programs, combining automated testing, iterative code repair, and multi-agent tournaments to provide a holistic view of program behavior. Applied to a range of state-of-the-art coders and games, our approach uncovers notable discrepancies between benchmark scores and actual performance in dynamic settings, revealing overlooked limitations and opportunities for improvement. These findings highlight…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Games · Software Engineering Research · Topic Modeling
