TL;DR
This paper introduces a step-wise intervention framework using contrastive safety directions to improve safety in diffusion language models without sacrificing output quality.
Contribution
It proposes a novel inference-time defense method that remasks harmful tokens and adaptively steers the denoising process, enhancing safety without additional fine-tuning.
Findings
Reduces jailbreak success rates to 0.64%.
Maintains generation quality close to original models.
Effective step-wise safety intervention demonstrated.
Abstract
Diffusion Language Models (DLMs) provide a promising alternative to autoregressive language models by generating text through iterative denoising and bidirectional refinement. However, this iterative generation paradigm also introduces unique safety vulnerabilities when harmful tokens generated at intermediate denoising steps propagate through subsequent refinement processes and eventually induce unsafe outputs. While there are a few attempts to remedy this issue, they either fail to generate safe outputs or generate safe yet low-quality outputs. This motivates us to propose an inference-time defense framework based on the step-wise intervention during the denoising process, which then improves the safety without compromising the output quality. The key component of our framework is a contrastive safety direction (SGD), a latent direction that captures the semantic boundary between…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
