Guaranteed Jailbreaking Defense via Disrupt-and-Rectify Smoothing
Zheng Lin, Zhenxing Niu, Haoxuan Ji, Haichang Gao

TL;DR
This paper introduces Disrupt-and-Rectify Smoothing, a novel defense method for large language models that enhances jailbreak resistance by combining prompt disruption and rectification within a smoothing framework.
Contribution
It presents a new two-stage prompt processing scheme that improves defense effectiveness and balances harmlessness and helpfulness against jailbreaking attacks.
Findings
Outperforms existing defenses in experiments
Effective against token-level and prompt-level attacks
Provides theoretical bounds for defense success
Abstract
This paper proposes a guaranteed defense method for large language models (LLMs) to safeguard against jailbreaking attacks. Drawing inspiration from the denoised-smoothing approach in the adversarial defense domain, we propose a novel smoothing-based defense method, termed Disrupt-and-Rectify Smoothing (DR-Smoothing). Specifically, we integrate a two-stage prompt processing scheme-first disrupting the input prompt, then rectifying it-into the conventional smoothing defense framework. This disrupt-and-rectify approach improves upon previous disrupt-only approaches by restoring out-of-distribution disrupted prompts to an in-distribution form, thereby reducing the risk of unpredictable LLM behavior. In addition, this two-stage scheme offers a distinct advantage in striking a balance between harmlessness and helpfulness in jailbreaking defense. Notably, we present a theoretical analysis for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
