Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models
Hanhan Zhou, Shamik Roy, Rashmi Gangadharaiah

TL;DR
This paper introduces an adaptive intervention scheduling method for discrete diffusion language models, significantly improving control precision over uniform methods by targeting attribute formation stages.
Contribution
The authors propose a novel adaptive scheduler that enhances attribute steering in DLMs by focusing interventions on active formation stages, outperforming uniform strategies.
Findings
Adaptive scheduling improves steering accuracy across multiple tasks.
The method maintains high generation quality while achieving up to 93% control strength.
Adaptive approach outperforms baseline uniform intervention, especially in multi-attribute control.
Abstract
Discrete diffusion language models (DLMs) generate text by iteratively denoising all positions in parallel, offering an alternative to autoregressive models. Controlled generation methods for DLMs, imported from autoregressive models, apply uniform intervention at every denoising steps. We show this uniform schedule degrades quality, and the damage compounds when multiple attributes are steered jointly. To diagnose the failure, we train sparse autoencoders on four DLMs (124M-8B parameters) and find that different attributes commit on distinct schedules, varying in timing, sharpness, and magnitude. For instance, topic commits within the first 2\% of denoising, whereas sentiment emerges gradually over 20\% of the process. Consequently, uniform intervention wastes steering capacity on steps where the target attribute has already solidified or has yet to emerge. We propose a novel adaptive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
