Activation Steering for Masked Diffusion Language Models

Adi Shnaidman; Erin Feiglin; Osher Yaari; Efrat Mentel; Amit Levi; Raz Lapid

arXiv:2512.24143·cs.CL·March 31, 2026

Activation Steering for Masked Diffusion Language Models

Adi Shnaidman, Erin Feiglin, Osher Yaari, Efrat Mentel, Amit Levi, Raz Lapid

PDF

TL;DR

This paper introduces an activation steering method for masked diffusion language models, enabling effective inference-time control by manipulating residual-stream activations to influence model behavior.

Contribution

It presents a novel activation steering primitive that extracts a low-dimensional control direction from prompt sets and applies it during diffusion, improving controllability without retraining.

Findings

01

Refusal behavior is governed by a consistent activation subspace.

02

Applying the extracted direction causes significant behavioral shifts.

03

Effective directions can be derived from both pre- and post-instruction tokens.

Abstract

Masked diffusion language models (MDLMs) generate text via iterative masked-token denoising, enabling mask-parallel decoding and distinct controllability and efficiency tradeoffs from autoregressive LLMs. Yet, efficient representation-level mechanisms for inference-time control in MDLMs remain largely unexplored. To address this gap, we introduce an activation steering primitive for MDLMs: we extract a single low-dimensional direction from contrastive prompt sets using one prompt-only forward pass, and apply a global intervention on residual-stream activations throughout reverse diffusion, without performing optimization or altering the diffusion sampling procedure. Using safety refusal as a deployment-relevant case study, we find that refusal behavior in multiple MDLMs is governed by a consistent, approximately one-dimensional activation subspace. Applying the corresponding direction…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.