Retrieval-Augmented Defense: Adaptive and Controllable Jailbreak Prevention for Large Language Models

Guangyu Yang; Jinghong Chen; Jingbiao Mei; Weizhe Lin; Bill Byrne

arXiv:2508.16406·cs.CR·November 4, 2025

Retrieval-Augmented Defense: Adaptive and Controllable Jailbreak Prevention for Large Language Models

Guangyu Yang, Jinghong Chen, Jingbiao Mei, Weizhe Lin, Bill Byrne

PDF

TL;DR

This paper introduces Retrieval-Augmented Defense (RAD), a framework that enhances large language model safety by detecting jailbreak attacks using a database of known attack examples, allowing for adaptive, training-free updates and controllable safety-utility trade-offs.

Contribution

RAD is a novel, training-free framework that incorporates retrieval of attack examples into detection, enabling adaptive and controllable jailbreak prevention for large language models.

Findings

01

RAD significantly reduces attack success rates on StrongREJECT benchmarks.

02

RAD maintains low false rejection rates for benign queries.

03

RAD achieves a robust safety-utility trade-off across various operating points.

Abstract

Large Language Models (LLMs) remain vulnerable to jailbreak attacks, which attempt to elicit harmful responses from LLMs. The evolving nature and diversity of these attacks pose many challenges for defense systems, including (1) adaptation to counter emerging attack strategies without costly retraining, and (2) control of the trade-off between safety and utility. To address these challenges, we propose Retrieval-Augmented Defense (RAD), a novel framework for jailbreak detection that incorporates a database of known attack examples into Retrieval-Augmented Generation, which is used to infer the underlying, malicious user query and jailbreak strategy used to attack the system. RAD enables training-free updates for newly discovered jailbreak strategies and provides a mechanism to balance safety and utility. Experiments on StrongREJECT show that RAD substantially reduces the effectiveness…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.