Evaluating Large Language Models as Live Strategic Agents: Provider Performance, Hybrid Decomposition, and Operational Gaps in Timed Risk Play
H. C. Ekne

TL;DR
This paper evaluates large language models in a timed, multi-phase risk environment, revealing that live-agent performance depends on system behavior, planning, and execution rather than just raw model capabilities.
Contribution
It introduces a novel evaluation framework for LLMs as live agents in timed strategic settings, highlighting the importance of system-level factors over isolated benchmarks.
Findings
Gemini-3.1-pro-preview outperformed competitors in a 32-game championship.
System behavior significantly influences performance differences among models.
Separating planning from execution shows planning quality is comparable across models.
Abstract
Static benchmarks capture only part of how large language models behave in practice. Real systems place models inside repeated loops with time limits, formatting constraints, and failure modes. We study this setting in a timed multi-phase Risk environment with explicit victory targets and repeated planning and execution cycles. In a replicated 32-game cross-provider championship under frozen rules, gemini-3.1-pro-preview won 20 of 32 games against gpt-5.1, claude-opus-4-7, and kimi-k2.6, and the pooled winner distribution differs strongly from an equal-strength null (p approx 1.5 x 10^-5). We then separate planning from execution by standardizing execution on a cheaper Gemini Flash scaffold. Under this design, a pooled 32-game planner bakeoff is consistent with near-equality (p approx 0.821), which indicates that much of the earlier provider spread came from end-to-end system behavior…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
