ARMOR 2025: A Military-Aligned Benchmark for Evaluating Large Language Model Safety Beyond Civilian Contexts
Sydney Johns, Heng Jin, Chaoyu Zhang, Y. Thomas Hou, Wenjing Lou

TL;DR
ARMOR 2025 is a new safety benchmark for large language models, specifically designed to evaluate their adherence to military doctrines like the Law of War and Rules of Engagement in defense contexts.
Contribution
The paper introduces ARMOR 2025, a novel military-aligned safety benchmark with doctrinal questions and a structured taxonomy for evaluating LLM safety in military scenarios.
Findings
Evaluation reveals significant safety gaps in current LLMs for military use.
The benchmark covers 519 doctrinal prompts across 12 categories.
Systematic testing highlights the need for improved safety alignment in models.
Abstract
Large language models (LLMs) are now being explored for defense applications that require reliable and legally compliant decision support. They also hold significant potential to enhance decision making, coordination, and operational efficiency in military contexts. These uses demand evaluation methods that reflect the doctrinal standards that guide real military operations. Existing safety benchmarks focus on general social risks and do not test whether models follow the legal and ethical rules that govern real military operations. To address this gap, we introduce ARMOR 2025, a military aligned safety benchmark grounded in three core military doctrines the Law of War, the Rules of Engagement, and the Joint Ethics Regulation. We extract doctrinal text from these sources and generate multiple choice questions that preserve the intended meaning of each rule. The benchmark is organized…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
