SC2Arena and StarEvolve: Benchmark and Self-Improvement Framework for LLMs in Complex Decision-Making Tasks
Pengbo Shen, Yaqing Wang, Ni Mu, Yao Luan, Runpeng Xie, Senhao Yang, Lexiang Wang, Hao Hu, Shuang Xu, Yiqin Yang, Bo Xu

TL;DR
This paper introduces SC2Arena, a comprehensive StarCraft II benchmark, and StarEvolve, a hierarchical self-improvement framework for LLMs to enhance strategic decision-making in complex environments.
Contribution
It presents a full-scale StarCraft II benchmark supporting all playable races and actions, and a novel self-improvement framework for LLMs with iterative fine-tuning and strategic planning.
Findings
StarEvolve outperforms existing models in strategic planning tasks.
SC2Arena enables more realistic and comprehensive evaluation of LLMs.
The framework facilitates continuous self-improvement of AI agents.
Abstract
Evaluating large language models (LLMs) in complex decision-making is essential for advancing AI's ability for strategic planning and real-time adaptation. However, existing benchmarks for tasks like StarCraft II fail to capture the game's full complexity, such as its complete game context, diverse action spaces, and all playable races. To address this gap, we present SC2Arena, a benchmark that fully supports all playable races, low-level action spaces, and optimizes text-based observations to tackle spatial reasoning challenges. Complementing this, we introduce StarEvolve, a hierarchical framework that integrates strategic planning with tactical execution, featuring iterative self-correction and continuous improvement via fine-tuning on high-quality gameplay data. Its key components include a Planner-Executor-Verifier structure to break down gameplay, and a scoring system for selecting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
