Diffusion-CAM: Faithful Visual Explanations for dMLLMs

Haomin Zuo; Yidi Li; Luoxiao Yang; Xiaofeng Zhang

arXiv:2604.11005·cs.AI·April 14, 2026

Diffusion-CAM: Faithful Visual Explanations for dMLLMs

Haomin Zuo, Yidi Li, Luoxiao Yang, Xiaofeng Zhang

PDF

TL;DR

Diffusion-CAM is a novel interpretability method designed specifically for diffusion-based multimodal large language models, providing accurate and faithful visual explanations of their non-autoregressive token generation process.

Contribution

It introduces the first tailored interpretability approach for dMLLMs, capturing both latent features and class-specific gradients to improve understanding of their parallel generation.

Findings

01

Diffusion-CAM outperforms state-of-the-art methods in localization accuracy.

02

It enhances visual fidelity of explanations for diffusion multimodal models.

03

The method effectively resolves spatial ambiguity and reduces confounders in explanations.

Abstract

While diffusion Multimodal Large Language Models (dMLLMs) have recently achieved remarkable strides in multimodal generation, the development of interpretability mechanisms has lagged behind their architectural evolution. Unlike traditional autoregressive models that produce sequential activations, diffusion-based architectures generate tokens via parallel denoising, resulting in smooth, distributed activation patterns across the entire sequence. Consequently, existing Class Activation Mapping (CAM) methods, which are tailored for local, sequential dependencies, are ill-suited for interpreting these non-autoregressive behaviors. To bridge this gap, we propose Diffusion-CAM, the first interpretability method specifically tailored for dMLLMs. We derive raw activation maps by differentiably probing intermediate representations in the transformer backbone, accordingly capturing both latent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.