BotzoneBench: Scalable LLM Evaluation via Graded AI Anchors

Lingfeng Li; Yunlong Lu; Yuefei Zhang; Jingyu Yao; Yixin Zhu; KeYuan Cheng; Yongyi Wang; Qirui Zheng; Xionghui Yang; Wenxin Li

arXiv:2602.13214·cs.AI·February 17, 2026

BotzoneBench: Scalable LLM Evaluation via Graded AI Anchors

Lingfeng Li, Yunlong Lu, Yuefei Zhang, Jingyu Yao, Yixin Zhu, KeYuan Cheng, Yongyi Wang, Qirui Zheng, Xionghui Yang, Wenxin Li

PDF

Open Access

TL;DR

This paper introduces BotzoneBench, a scalable evaluation framework for large language models that assesses strategic reasoning using fixed skill hierarchies, enabling stable, interpretable, and efficient cross-temporal comparisons across diverse games.

Contribution

It proposes anchoring LLM evaluation to fixed AI skill hierarchies, allowing linear-time absolute skill measurement and stable performance tracking over time.

Findings

01

Top models reach proficiency comparable to specialized game AI.

02

Significant performance disparities among evaluated models.

03

The framework generalizes to domains with well-defined skill hierarchies.

Abstract

Large Language Models (LLMs) are increasingly deployed in interactive environments requiring strategic decision-making, yet systematic evaluation of these capabilities remains challenging. Existing benchmarks for LLMs primarily assess static reasoning through isolated tasks and fail to capture dynamic strategic abilities. Recent game-based evaluations employ LLM-vs-LLM tournaments that produce relative rankings dependent on transient model pools, incurring quadratic computational costs and lacking stable performance anchors for longitudinal tracking. The central challenge is establishing a scalable evaluation framework that measures LLM strategic reasoning against consistent, interpretable standards rather than volatile peer models. Here we show that anchoring LLM evaluation to fixed hierarchies of skill-calibrated game Artificial Intelligence (AI) enables linear-time absolute skill…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Games · Explainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications