LLMs May Not Be Human-Level Players, But They Can Be Testers: Measuring Game Difficulty with LLM Agents
Chang Xiao, Brenda Z. Yang

TL;DR
This paper investigates using Large Language Models as automated testers to measure game difficulty, demonstrating their potential to correlate with human perceptions and aid game development.
Contribution
The study introduces a novel framework employing LLMs for game difficulty assessment and validates its effectiveness on popular strategy games.
Findings
LLMs' performance correlates with human difficulty ratings.
Simple prompting guides LLMs to effectively assess game challenge.
LLMs can serve as reliable tools in game testing processes.
Abstract
Recent advances in Large Language Models (LLMs) have demonstrated their potential as autonomous agents across various tasks. One emerging application is the use of LLMs in playing games. In this work, we explore a practical problem for the gaming industry: Can LLMs be used to measure game difficulty? We propose a general game-testing framework using LLM agents and test it on two widely played strategy games: Wordle and Slay the Spire. Our results reveal an interesting finding: although LLMs may not perform as well as the average human player, their performance, when guided by simple, generic prompting techniques, shows a statistically significant and strong correlation with difficulty indicated by human players. This suggests that LLMs could serve as effective agents for measuring game difficulty during the development process. Based on our experiments, we also outline general…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Law · Law, AI, and Intellectual Property · Law, Economics, and Judicial Systems
