CivBench: Progress-Based Evaluation for LLMs' Strategic Decision-Making in Civilization V

John Chen; Sihan Cheng; Can Gurkan; Mingyi Lin

arXiv:2604.07733·cs.AI·April 10, 2026

CivBench: Progress-Based Evaluation for LLMs' Strategic Decision-Making in Civilization V

John Chen, Sihan Cheng, Can Gurkan, Mingyi Lin

PDF

TL;DR

CivBench is a comprehensive benchmark for evaluating LLM-based agents' strategic decision-making in Civilization V, focusing on turn-level predictions and multi-agent long-horizon gameplay.

Contribution

It introduces a novel evaluation framework that captures strategic capabilities through turn-level signals, addressing limitations of outcome-only assessments.

Findings

01

CivBench effectively estimates strategic capabilities across diverse models.

02

Model-specific effects of agentic setups are revealed.

03

Distinct strategic profiles are identified beyond win/loss outcomes.

Abstract

Evaluating strategic decision-making in LLM-based agents requires generative, competitive, and longitudinal environments, yet few benchmarks provide all three, and fewer still offer evaluation signals rich enough for long-horizon, multi-agent play. We introduce CivBench, a benchmark for LLM strategists (i.e., agentic setups) in multiplayer Civilization V. Because terminal win/loss is too sparse a signal in games spanning hundreds of turns and multiple opponents, CivBench trains models on turn-level game state to estimate victory probabilities throughout play, validated through predictive, construct, and convergent validity. Across 307 games with 7 LLMs and multiple CivBench agent conditions, we demonstrate CivBench's potential to estimate strategic capabilities as an unsaturated benchmark, reveal model-specific effects of agentic setup, and outline distinct strategic profiles not…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.