ShieldLearner: A New Paradigm for Jailbreak Attack Defense in LLMs
Ziyi Ni, Hao Wang, Huacan Wang

TL;DR
ShieldLearner introduces an adaptive, interpretable defense paradigm for LLM jailbreak attacks, utilizing human-like learning, attack signature distillation, and continuous self-improvement to outperform existing methods.
Contribution
It presents ShieldLearner, a novel defense framework that mimics human learning, distills attack signatures, and employs adaptive augmentation for improved jailbreak attack resistance.
Findings
Achieves higher defense success rate than baselines on standard and hard test sets.
Operates with lower computational overhead, enhancing practicality.
Effective against concealed malicious prompts in adversarial datasets.
Abstract
Large Language Models (LLMs) have achieved remarkable success in various domains but remain vulnerable to adversarial jailbreak attacks. Existing prompt-defense strategies, including parameter-modifying and parameter-free approaches, face limitations in adaptability, interpretability, and customization, constraining their effectiveness against evolving threats. To address these challenges, we propose ShieldLearner, a novel paradigm that mimics human learning in defense. Through trial and error, it autonomously distills attack signatures into a Pattern Atlas and synthesizes defense heuristics into a Meta-analysis Framework, enabling systematic and interpretable threat detection. Furthermore, we introduce Adaptive Adversarial Augmentation to generate adversarial variations of successfully defended prompts, enabling continuous self-improvement without model retraining. In addition to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInformation and Cyber Security
MethodsSparse Evolutionary Training
