GBQA: A Game Benchmark for Evaluating LLMs as Quality Assurance Engineers

Shufan Jiang; Chios Chen; Zhiyang Chen

arXiv:2604.02648·cs.SE·April 6, 2026

GBQA: A Game Benchmark for Evaluating LLMs as Quality Assurance Engineers

Shufan Jiang, Chios Chen, Zhiyang Chen

PDF

1 Repo

TL;DR

This paper introduces GBQA, a benchmark with 30 games and 124 verified bugs, to evaluate large language models' ability to autonomously discover software bugs in game development.

Contribution

The paper presents GBQA, a scalable, multi-agent benchmark for evaluating LLMs in bug detection within game development, including a baseline interactive agent and extensive experimental results.

Findings

01

Best LLM detects 48.39% of bugs

02

Autonomous bug discovery remains highly challenging

03

GBQA serves as an effective testbed for progress in software engineering

Abstract

The autonomous discovery of bugs remains a significant challenge in modern software development. Compared to code generation, the complexity of dynamic runtime environments makes bug discovery considerably harder for large language models (LLMs). In this paper, we take game development as a representative domain and introduce the Game Benchmark for Quality Assurance (GBQA), a benchmark containing 30 games and 124 human-verified bugs across three difficulty levels, to evaluate whether LLMs can autonomously detect software bugs. The benchmark is constructed using a multi-agent system that develops games and injects bugs in a scalable manner, with human experts in the loop to ensure correctness. Moreover, we provide a baseline interactive agent equipped with a multi-round ReAct loop and a memory mechanism, enabling long-horizon exploration of game environments for bug detection across…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

camel-ai/GBQA
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.