ARMOR: Aligning Secure and Safe Large Language Models via Meticulous Reasoning

Zhengyue Zhao; Yingzi Ma; Somesh Jha; Marco Pavone; Patrick McDaniel; Chaowei Xiao

arXiv:2507.11500·cs.CR·October 21, 2025

ARMOR: Aligning Secure and Safe Large Language Models via Meticulous Reasoning

Zhengyue Zhao, Yingzi Ma, Somesh Jha, Marco Pavone, Patrick McDaniel, Chaowei Xiao

PDF

Open Access

TL;DR

ARMOR introduces a structured reasoning pipeline to improve the safety of large language models by effectively identifying and mitigating malicious jailbreak strategies, achieving state-of-the-art safety performance.

Contribution

The paper proposes ARMOR, a novel three-step reasoning framework that enhances LLM safety by extracting malicious intent and verifying safety, outperforming existing methods against advanced jailbreaks.

Findings

01

ARMOR achieves a harmful output rate of 0.002.

02

ARMOR reduces attack success rate to 0.06.

03

ARMOR generalizes well to unseen jailbreaks.

Abstract

Large Language Models have shown impressive generative capabilities across diverse tasks, but their safety remains a critical concern. Existing post-training alignment methods, such as SFT and RLHF, reduce harmful outputs yet leave LLMs vulnerable to jailbreak attacks, especially advanced optimization-based ones. Recent system-2 approaches enhance safety by adding inference-time reasoning, where models assess potential risks before producing responses. However, we find these methods fail against powerful out-of-distribution jailbreaks, such as AutoDAN-Turbo and Adversarial Reasoning, which conceal malicious goals behind seemingly benign prompts. We observe that all jailbreaks ultimately aim to embed a core malicious intent, suggesting that extracting this intent is key to defense. To this end, we propose ARMOR, which introduces a structured three-step reasoning pipeline: (1) analyze…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Access Control and Trust