TL;DR
This paper introduces GBQA, a benchmark with 30 games and 124 verified bugs, to evaluate large language models' ability to autonomously discover software bugs in game development.
Contribution
The paper presents GBQA, a scalable, multi-agent benchmark for evaluating LLMs in bug detection within game development, including a baseline interactive agent and extensive experimental results.
Findings
Best LLM detects 48.39% of bugs
Autonomous bug discovery remains highly challenging
GBQA serves as an effective testbed for progress in software engineering
Abstract
The autonomous discovery of bugs remains a significant challenge in modern software development. Compared to code generation, the complexity of dynamic runtime environments makes bug discovery considerably harder for large language models (LLMs). In this paper, we take game development as a representative domain and introduce the Game Benchmark for Quality Assurance (GBQA), a benchmark containing 30 games and 124 human-verified bugs across three difficulty levels, to evaluate whether LLMs can autonomously detect software bugs. The benchmark is constructed using a multi-agent system that develops games and injects bugs in a scalable manner, with human experts in the loop to ensure correctness. Moreover, we provide a baseline interactive agent equipped with a multi-round ReAct loop and a memory mechanism, enabling long-horizon exploration of game environments for bug detection across…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
