WARBENCH: A Comprehensive Benchmark for Evaluating LLMs in Military Decision-Making
Zongjie Li, Chaozheng Wang, Yuchong Xie, Pingchuan Ma, and Shuai Wang

TL;DR
WARBENCH introduces a comprehensive benchmarking framework for evaluating large language models in military decision-making, revealing critical vulnerabilities and operational risks in current models under realistic tactical conditions.
Contribution
The paper presents WARBENCH, a new evaluation framework that addresses structural blindspots and tests models under realistic military scenarios, including legal and operational constraints.
Findings
Models fail under complex terrain and force asymmetry.
Edge-optimized small models have high legal violation rates.
Explicit reasoning improves safety and compliance.
Abstract
Large Language Models are increasingly being considered for deployment in safety-critical military applications. However, current benchmarks suffer from structural blindspots that systematically overestimate model capabilities in real-world tactical scenarios. Existing frameworks typically ignore strict legal constraints based on International Humanitarian Law (IHL), omit edge computing limitations, lack robustness testing for fog of war, and inadequately evaluate explicit reasoning. To address these vulnerabilities, we present WARBENCH, a comprehensive evaluation framework establishing a foundational tactical baseline alongside four distinct stress testing dimensions. Through a large scale empirical evaluation of nine leading models on 136 high-fidelity historical scenarios, we reveal severe structural flaws. First, baseline tactical reasoning systematically collapses under complex…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Ethics and Social Impacts of AI · Explainable Artificial Intelligence (XAI)
