WARBENCH: A Comprehensive Benchmark for Evaluating LLMs in Military Decision-Making

Zongjie Li; Chaozheng Wang; Yuchong Xie; Pingchuan Ma; and Shuai Wang

arXiv:2603.21280·cs.CY·March 24, 2026

WARBENCH: A Comprehensive Benchmark for Evaluating LLMs in Military Decision-Making

Zongjie Li, Chaozheng Wang, Yuchong Xie, Pingchuan Ma, and Shuai Wang

PDF

Open Access

TL;DR

WARBENCH introduces a comprehensive benchmarking framework for evaluating large language models in military decision-making, revealing critical vulnerabilities and operational risks in current models under realistic tactical conditions.

Contribution

The paper presents WARBENCH, a new evaluation framework that addresses structural blindspots and tests models under realistic military scenarios, including legal and operational constraints.

Findings

01

Models fail under complex terrain and force asymmetry.

02

Edge-optimized small models have high legal violation rates.

03

Explicit reasoning improves safety and compliance.

Abstract

Large Language Models are increasingly being considered for deployment in safety-critical military applications. However, current benchmarks suffer from structural blindspots that systematically overestimate model capabilities in real-world tactical scenarios. Existing frameworks typically ignore strict legal constraints based on International Humanitarian Law (IHL), omit edge computing limitations, lack robustness testing for fog of war, and inadequately evaluate explicit reasoning. To address these vulnerabilities, we present WARBENCH, a comprehensive evaluation framework establishing a foundational tactical baseline alongside four distinct stress testing dimensions. Through a large scale empirical evaluation of nine leading models on 136 high-fidelity historical scenarios, we reveal severe structural flaws. First, baseline tactical reasoning systematically collapses under complex…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Ethics and Social Impacts of AI · Explainable Artificial Intelligence (XAI)