Guaranteed Jailbreaking Defense via Disrupt-and-Rectify Smoothing

Zheng Lin; Zhenxing Niu; Haoxuan Ji; Haichang Gao

arXiv:2605.10582·cs.CR·May 12, 2026

Guaranteed Jailbreaking Defense via Disrupt-and-Rectify Smoothing

Zheng Lin, Zhenxing Niu, Haoxuan Ji, Haichang Gao

PDF

TL;DR

This paper introduces Disrupt-and-Rectify Smoothing, a novel defense method for large language models that enhances jailbreak resistance by combining prompt disruption and rectification within a smoothing framework.

Contribution

It presents a new two-stage prompt processing scheme that improves defense effectiveness and balances harmlessness and helpfulness against jailbreaking attacks.

Findings

01

Outperforms existing defenses in experiments

02

Effective against token-level and prompt-level attacks

03

Provides theoretical bounds for defense success

Abstract

This paper proposes a guaranteed defense method for large language models (LLMs) to safeguard against jailbreaking attacks. Drawing inspiration from the denoised-smoothing approach in the adversarial defense domain, we propose a novel smoothing-based defense method, termed Disrupt-and-Rectify Smoothing (DR-Smoothing). Specifically, we integrate a two-stage prompt processing scheme-first disrupting the input prompt, then rectifying it-into the conventional smoothing defense framework. This disrupt-and-rectify approach improves upon previous disrupt-only approaches by restoring out-of-distribution disrupted prompts to an in-distribution form, thereby reducing the risk of unpredictable LLM behavior. In addition, this two-stage scheme offers a distinct advantage in striking a balance between harmlessness and helpfulness in jailbreaking defense. Notably, we present a theoretical analysis for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.