Beyond Surface-Level Detection: Towards Cognitive-Driven Defense Against Jailbreak Attacks via Meta-Operations Reasoning

Rui Pu; Chaozhuo Li; Rui Ha; Litian Zhang; Lirong Qiu; Xi Zhang

arXiv:2508.03054·cs.AI·August 6, 2025

Beyond Surface-Level Detection: Towards Cognitive-Driven Defense Against Jailbreak Attacks via Meta-Operations Reasoning

Rui Pu, Chaozhuo Li, Rui Ha, Litian Zhang, Lirong Qiu, Xi Zhang

PDF

TL;DR

This paper introduces a cognitive-driven framework using meta-operations reasoning to improve the detection of unseen jailbreak prompts in large language models, surpassing traditional pattern-matching defenses.

Contribution

It proposes a novel structured reasoning approach with reinforcement learning to generalize defenses against diverse and unseen jailbreak strategies.

Findings

01

Achieves state-of-the-art defense performance.

02

Demonstrates strong generalization to unseen attacks.

03

Utilizes entropy-guided reinforcement learning for exploration.

Abstract

Defending large language models (LLMs) against jailbreak attacks is essential for their safe and reliable deployment. Existing defenses often rely on shallow pattern matching, which struggles to generalize to novel and unseen attack strategies. To address this challenge, we propose the Cognitive-Driven Defense (CDD) framework, which targets the underlying structure of jailbreak prompts by applying meta-operations, defined as basic manipulations that conceal harmful intent.CDD emulates human cognitive reasoning through a structured reasoning chain. It begins with a global perception of the prompt and follows with a localized analysis to uncover hidden manipulations. By applying supervised fine-tuning on this structured chain, the model learns to identify and reason about known manipulation patterns. To enhance generalization to unseen threats, an entropy-guided reinforcement learning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.