BotzoneBench: Scalable LLM Evaluation via Graded AI Anchors
Lingfeng Li, Yunlong Lu, Yuefei Zhang, Jingyu Yao, Yixin Zhu, KeYuan Cheng, Yongyi Wang, Qirui Zheng, Xionghui Yang, Wenxin Li

TL;DR
This paper introduces BotzoneBench, a scalable evaluation framework for large language models that assesses strategic reasoning using fixed skill hierarchies, enabling stable, interpretable, and efficient cross-temporal comparisons across diverse games.
Contribution
It proposes anchoring LLM evaluation to fixed AI skill hierarchies, allowing linear-time absolute skill measurement and stable performance tracking over time.
Findings
Top models reach proficiency comparable to specialized game AI.
Significant performance disparities among evaluated models.
The framework generalizes to domains with well-defined skill hierarchies.
Abstract
Large Language Models (LLMs) are increasingly deployed in interactive environments requiring strategic decision-making, yet systematic evaluation of these capabilities remains challenging. Existing benchmarks for LLMs primarily assess static reasoning through isolated tasks and fail to capture dynamic strategic abilities. Recent game-based evaluations employ LLM-vs-LLM tournaments that produce relative rankings dependent on transient model pools, incurring quadratic computational costs and lacking stable performance anchors for longitudinal tracking. The central challenge is establishing a scalable evaluation framework that measures LLM strategic reasoning against consistent, interpretable standards rather than volatile peer models. Here we show that anchoring LLM evaluation to fixed hierarchies of skill-calibrated game Artificial Intelligence (AI) enables linear-time absolute skill…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Games · Explainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications
