Scalable Defense against In-the-wild Jailbreaking Attacks with Safety Context Retrieval
Taiye Chen, Zeming Wei, Ang Li, Yisen Wang

TL;DR
This paper introduces Safety Context Retrieval (SCR), a scalable method using context retrieval to enhance LLM safety by defending against evolving jailbreaking attacks, outperforming existing static defenses.
Contribution
We propose SCR, a novel retrieval-augmented approach that significantly improves robustness of LLMs against diverse and emerging jailbreaking techniques.
Findings
SCR outperforms existing defenses against known jailbreaks.
Even minimal safety examples improve robustness.
SCR effectively counters new jailbreak methods.
Abstract
Large Language Models (LLMs) are known to be vulnerable to jailbreaking attacks, wherein adversaries exploit carefully engineered prompts to induce harmful or unethical responses. Such threats have raised critical concerns about the safety and reliability of LLMs in real-world deployment. While existing defense mechanisms partially mitigate such risks, subsequent advancements in adversarial techniques have enabled novel jailbreaking methods to circumvent these protections, exposing the limitations of static defense frameworks. In this work, we explore defending against evolving jailbreaking threats through the lens of context retrieval. First, we conduct a preliminary study demonstrating that even a minimal set of safety-aligned examples against a particular jailbreak can significantly enhance robustness against this attack pattern. Building on this insight, we further leverage the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital and Cyber Forensics · Adversarial Robustness in Machine Learning · Advanced Malware Detection Techniques
MethodsSparse Evolutionary Training
