Mage: Multi-Axis Evaluation of LLM-Generated Executable Game Scenes Beyond Compile-Pass Rate

Hugh Xuechen Liu; K{\i}van\c{c} Tatar

arXiv:2605.07342·cs.LG·May 11, 2026

Mage: Multi-Axis Evaluation of LLM-Generated Executable Game Scenes Beyond Compile-Pass Rate

Hugh Xuechen Liu, K{\i}van\c{c} Tatar

PDF

TL;DR

This paper introduces a multi-axis evaluation protocol called 'Mage' for assessing LLM-generated executable game scenes, revealing that compile success does not equate to functional correctness and emphasizing the need for comprehensive evaluation metrics.

Contribution

The paper proposes a novel four-axis evaluation framework for game scene synthesis, demonstrating its effectiveness over traditional compile-pass metrics and providing a new benchmark dataset.

Findings

01

NL-to-C# generation has high compile success but low structural fidelity.

02

IR conditioning improves structural correctness at the cost of runtime success.

03

Multi-axis evaluation reveals divergence between compile success and functional correctness.

Abstract

Compile-pass rate is the dominant evaluation signal for LLM code generation, yet for multi-component domain-specific artifacts it can be actively misleading. We demonstrate this on executable game scene synthesis with a four-axis evaluation protocol (named `Mage') -- compile success, runtime success, structural fidelity, and mechanism adherence -- applied to 858 generation attempts across four open-weight LLMs (7B--30B), 26~hand-crafted Unity goal pattern playable concepts, and two automatically extracted IR granularity levels. Direct NL-to-C\# generation achieves the highest runtime-pass rate (43\% mean) yet produces structurally vacuous scenes (mechanism $F_{1} \approx 0.12$ ). Structural IR conditioning halves the runtime rate but recovers domain-faithful structure ( $F_{1}$ up to 1.00). Within IR conditioning, behavior-only and full-scene granularity are statistically indistinguishable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.