ARMOR: Aligning Secure and Safe Large Language Models via Meticulous Reasoning
Zhengyue Zhao, Yingzi Ma, Somesh Jha, Marco Pavone, Patrick McDaniel, Chaowei Xiao

TL;DR
ARMOR introduces a structured reasoning pipeline to improve the safety of large language models by effectively identifying and mitigating malicious jailbreak strategies, achieving state-of-the-art safety performance.
Contribution
The paper proposes ARMOR, a novel three-step reasoning framework that enhances LLM safety by extracting malicious intent and verifying safety, outperforming existing methods against advanced jailbreaks.
Findings
ARMOR achieves a harmful output rate of 0.002.
ARMOR reduces attack success rate to 0.06.
ARMOR generalizes well to unseen jailbreaks.
Abstract
Large Language Models have shown impressive generative capabilities across diverse tasks, but their safety remains a critical concern. Existing post-training alignment methods, such as SFT and RLHF, reduce harmful outputs yet leave LLMs vulnerable to jailbreak attacks, especially advanced optimization-based ones. Recent system-2 approaches enhance safety by adding inference-time reasoning, where models assess potential risks before producing responses. However, we find these methods fail against powerful out-of-distribution jailbreaks, such as AutoDAN-Turbo and Adversarial Reasoning, which conceal malicious goals behind seemingly benign prompts. We observe that all jailbreaks ultimately aim to embed a core malicious intent, suggesting that extracting this intent is key to defense. To this end, we propose ARMOR, which introduces a structured three-step reasoning pipeline: (1) analyze…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Access Control and Trust
