ShieldLearner: A New Paradigm for Jailbreak Attack Defense in LLMs

Ziyi Ni; Hao Wang; Huacan Wang

arXiv:2502.13162·cs.CR·February 20, 2025

ShieldLearner: A New Paradigm for Jailbreak Attack Defense in LLMs

Ziyi Ni, Hao Wang, Huacan Wang

PDF

Open Access

TL;DR

ShieldLearner introduces an adaptive, interpretable defense paradigm for LLM jailbreak attacks, utilizing human-like learning, attack signature distillation, and continuous self-improvement to outperform existing methods.

Contribution

It presents ShieldLearner, a novel defense framework that mimics human learning, distills attack signatures, and employs adaptive augmentation for improved jailbreak attack resistance.

Findings

01

Achieves higher defense success rate than baselines on standard and hard test sets.

02

Operates with lower computational overhead, enhancing practicality.

03

Effective against concealed malicious prompts in adversarial datasets.

Abstract

Large Language Models (LLMs) have achieved remarkable success in various domains but remain vulnerable to adversarial jailbreak attacks. Existing prompt-defense strategies, including parameter-modifying and parameter-free approaches, face limitations in adaptability, interpretability, and customization, constraining their effectiveness against evolving threats. To address these challenges, we propose ShieldLearner, a novel paradigm that mimics human learning in defense. Through trial and error, it autonomously distills attack signatures into a Pattern Atlas and synthesizes defense heuristics into a Meta-analysis Framework, enabling systematic and interpretable threat detection. Furthermore, we introduce Adaptive Adversarial Augmentation to generate adversarial variations of successfully defended prompts, enabling continuous self-improvement without model retraining. In addition to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInformation and Cyber Security

MethodsSparse Evolutionary Training