Beyond Surface-Level Patterns: An Essence-Driven Defense Framework Against Jailbreak Attacks in LLMs

Shiyu Xiang; Ansen Zhang; Yanfei Cao; Yang Fan; Ronghao Chen

arXiv:2502.19041·cs.CR·May 29, 2025

Beyond Surface-Level Patterns: An Essence-Driven Defense Framework Against Jailbreak Attacks in LLMs

Shiyu Xiang, Ansen Zhang, Yanfei Cao, Yang Fan, Ronghao Chen

PDF

Open Access

TL;DR

This paper introduces EDDF, a novel essence-driven defense framework that enhances LLMs' robustness against jailbreak attacks by focusing on attack core patterns rather than surface features.

Contribution

The paper proposes a plug-and-play, two-stage defense method that extracts and stores attack essences for improved detection of adversarial prompts in LLMs.

Findings

01

EDDF reduces attack success rate by at least 20%.

02

It outperforms existing defenses significantly.

03

The framework effectively captures attack core patterns.

Abstract

Although Aligned Large Language Models (LLMs) are trained to refuse harmful requests, they remain vulnerable to jailbreak attacks. Unfortunately, existing methods often focus on surface-level patterns, overlooking the deeper attack essences. As a result, defenses fail when attack prompts change, even though the underlying "attack essence" remains the same. To address this issue, we introduce EDDF, an \textbf{E}ssence-\textbf{D}riven \textbf{D}efense \textbf{F}ramework Against Jailbreak Attacks in LLMs. EDDF is a plug-and-play input-filtering method and operates in two stages: 1) offline essence database construction, and 2) online adversarial query detection. The key idea behind EDDF is to extract the "attack essence" from a diverse set of known attack instances and store it in an offline vector database. Experimental results demonstrate that EDDF significantly outperforms existing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Advanced Graph Neural Networks