Adaptive Steering and Remasking for Safe Generation in Diffusion Language Models

Yejin Lee; Yo-Sub Han

arXiv:2605.13043·cs.CL·May 14, 2026

Adaptive Steering and Remasking for Safe Generation in Diffusion Language Models

Yejin Lee, Yo-Sub Han

PDF

1 Repo

TL;DR

This paper introduces a step-wise intervention framework using contrastive safety directions to improve safety in diffusion language models without sacrificing output quality.

Contribution

It proposes a novel inference-time defense method that remasks harmful tokens and adaptively steers the denoising process, enhancing safety without additional fine-tuning.

Findings

01

Reduces jailbreak success rates to 0.64%.

02

Maintains generation quality close to original models.

03

Effective step-wise safety intervention demonstrated.

Abstract

Diffusion Language Models (DLMs) provide a promising alternative to autoregressive language models by generating text through iterative denoising and bidirectional refinement. However, this iterative generation paradigm also introduces unique safety vulnerabilities when harmful tokens generated at intermediate denoising steps propagate through subsequent refinement processes and eventually induce unsafe outputs. While there are a few attempts to remedy this issue, they either fail to generate safe outputs or generate safe yet low-quality outputs. This motivates us to propose an inference-time defense framework based on the step-wise intervention during the denoising process, which then improves the safety without compromising the output quality. The key component of our framework is a contrastive safety direction (SGD), a latent direction that captures the semantic boundary between…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

leeyejin1231/DLM_Steering_Remasking
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.