Scalable Defense against In-the-wild Jailbreaking Attacks with Safety Context Retrieval

Taiye Chen; Zeming Wei; Ang Li; Yisen Wang

arXiv:2505.15753·cs.CR·May 22, 2025

Scalable Defense against In-the-wild Jailbreaking Attacks with Safety Context Retrieval

Taiye Chen, Zeming Wei, Ang Li, Yisen Wang

PDF

Open Access

TL;DR

This paper introduces Safety Context Retrieval (SCR), a scalable method using context retrieval to enhance LLM safety by defending against evolving jailbreaking attacks, outperforming existing static defenses.

Contribution

We propose SCR, a novel retrieval-augmented approach that significantly improves robustness of LLMs against diverse and emerging jailbreaking techniques.

Findings

01

SCR outperforms existing defenses against known jailbreaks.

02

Even minimal safety examples improve robustness.

03

SCR effectively counters new jailbreak methods.

Abstract

Large Language Models (LLMs) are known to be vulnerable to jailbreaking attacks, wherein adversaries exploit carefully engineered prompts to induce harmful or unethical responses. Such threats have raised critical concerns about the safety and reliability of LLMs in real-world deployment. While existing defense mechanisms partially mitigate such risks, subsequent advancements in adversarial techniques have enabled novel jailbreaking methods to circumvent these protections, exposing the limitations of static defense frameworks. In this work, we explore defending against evolving jailbreaking threats through the lens of context retrieval. First, we conduct a preliminary study demonstrating that even a minimal set of safety-aligned examples against a particular jailbreak can significantly enhance robustness against this attack pattern. Building on this insight, we further leverage the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital and Cyber Forensics · Adversarial Robustness in Machine Learning · Advanced Malware Detection Techniques

MethodsSparse Evolutionary Training