Diffusion-CAM: Faithful Visual Explanations for dMLLMs
Haomin Zuo, Yidi Li, Luoxiao Yang, Xiaofeng Zhang

TL;DR
Diffusion-CAM is a novel interpretability method designed specifically for diffusion-based multimodal large language models, providing accurate and faithful visual explanations of their non-autoregressive token generation process.
Contribution
It introduces the first tailored interpretability approach for dMLLMs, capturing both latent features and class-specific gradients to improve understanding of their parallel generation.
Findings
Diffusion-CAM outperforms state-of-the-art methods in localization accuracy.
It enhances visual fidelity of explanations for diffusion multimodal models.
The method effectively resolves spatial ambiguity and reduces confounders in explanations.
Abstract
While diffusion Multimodal Large Language Models (dMLLMs) have recently achieved remarkable strides in multimodal generation, the development of interpretability mechanisms has lagged behind their architectural evolution. Unlike traditional autoregressive models that produce sequential activations, diffusion-based architectures generate tokens via parallel denoising, resulting in smooth, distributed activation patterns across the entire sequence. Consequently, existing Class Activation Mapping (CAM) methods, which are tailored for local, sequential dependencies, are ill-suited for interpreting these non-autoregressive behaviors. To bridge this gap, we propose Diffusion-CAM, the first interpretability method specifically tailored for dMLLMs. We derive raw activation maps by differentiably probing intermediate representations in the transformer backbone, accordingly capturing both latent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
