DLM-Scope: Mechanistic Interpretability of Diffusion Language Models via Sparse Autoencoders

Xu Wang; Bingqing Jiang; Yu Wan; Baosong Yang; Lingpeng Kong; Difan Zou

arXiv:2602.05859·cs.LG·February 6, 2026

DLM-Scope: Mechanistic Interpretability of Diffusion Language Models via Sparse Autoencoders

Xu Wang, Bingqing Jiang, Yu Wan, Baosong Yang, Lingpeng Kong, Difan Zou

PDF

Open Access

TL;DR

This paper introduces DLM-Scope, a novel SAE-based interpretability framework for diffusion language models, revealing unique effects of SAE insertion and enabling improved model interventions and understanding.

Contribution

It is the first to adapt sparse autoencoders for interpretability in diffusion language models, demonstrating their effectiveness and uncovering distinct behaviors compared to autoregressive models.

Findings

01

SAEs can faithfully extract interpretable features in DLMs

02

SAE insertion reduces cross-entropy loss in early DLM layers, unlike in LLMs

03

SAE features enable more effective diffusion-time interventions

Abstract

Sparse autoencoders (SAEs) have become a standard tool for mechanistic interpretability in autoregressive large language models (LLMs), enabling researchers to extract sparse, human-interpretable features and intervene on model behavior. Recently, as diffusion language models (DLMs) have become an increasingly promising alternative to the autoregressive LLMs, it is essential to develop tailored mechanistic interpretability tools for this emerging class of models. In this work, we present DLM-Scope, the first SAE-based interpretability framework for DLMs, and demonstrate that trained Top-K SAEs can faithfully extract interpretable features. Notably, we find that inserting SAEs affects DLMs differently than autoregressive LLMs: while SAE insertion in LLMs typically incurs a loss penalty, in DLMs it can reduce cross-entropy loss when applied to early layers, a phenomenon absent or markedly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Computational and Text Analysis Methods · Domain Adaptation and Few-Shot Learning