ARMOR 2025: A Military-Aligned Benchmark for Evaluating Large Language Model Safety Beyond Civilian Contexts

Sydney Johns; Heng Jin; Chaoyu Zhang; Y. Thomas Hou; Wenjing Lou

arXiv:2605.00245·cs.AI·May 4, 2026

ARMOR 2025: A Military-Aligned Benchmark for Evaluating Large Language Model Safety Beyond Civilian Contexts

Sydney Johns, Heng Jin, Chaoyu Zhang, Y. Thomas Hou, Wenjing Lou

PDF

TL;DR

ARMOR 2025 is a new safety benchmark for large language models, specifically designed to evaluate their adherence to military doctrines like the Law of War and Rules of Engagement in defense contexts.

Contribution

The paper introduces ARMOR 2025, a novel military-aligned safety benchmark with doctrinal questions and a structured taxonomy for evaluating LLM safety in military scenarios.

Findings

01

Evaluation reveals significant safety gaps in current LLMs for military use.

02

The benchmark covers 519 doctrinal prompts across 12 categories.

03

Systematic testing highlights the need for improved safety alignment in models.

Abstract

Large language models (LLMs) are now being explored for defense applications that require reliable and legally compliant decision support. They also hold significant potential to enhance decision making, coordination, and operational efficiency in military contexts. These uses demand evaluation methods that reflect the doctrinal standards that guide real military operations. Existing safety benchmarks focus on general social risks and do not test whether models follow the legal and ethical rules that govern real military operations. To address this gap, we introduce ARMOR 2025, a military aligned safety benchmark grounded in three core military doctrines the Law of War, the Rules of Engagement, and the Joint Ethics Regulation. We extract doctrinal text from these sources and generate multiple choice questions that preserve the intended meaning of each rule. The benchmark is organized…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.