Token Activation Map to Visually Explain Multimodal LLMs
Yi Li, Hualiang Wang, Xinpeng Ding, Haonan Wang, Xiaomeng Li

TL;DR
This paper introduces Token Activation Map (TAM), a novel method for explaining multimodal large language models by mitigating context interference and noise, enabling high-quality visualizations for model understanding and analysis.
Contribution
The paper proposes TAM, an innovative explanation technique for MLLMs that accounts for token interactions and reduces redundant activations, improving interpretability over existing methods.
Findings
TAM outperforms state-of-the-art explanation methods in visualization quality.
TAM effectively explains multiple tokens in MLLMs, unlike CAM.
The method is versatile for various visualization scenarios, including object localization and failure analysis.
Abstract
Multimodal large language models (MLLMs) are broadly empowering various fields. Despite their advancements, the explainability of MLLMs remains less explored, hindering deeper understanding, model credibility, and effective visualization. Unlike conventional vision models (e.g., CNNs, ViTs, CLIP) that produce a single output, MLLMs generate sequences of tokens progressively, where each generated token depends on the previous context. Therefore, earlier context tokens can introduce redundant activations that interfere with the explanation of later tokens beyond their original information. Existing studies often overlook this issue, but our observations reveal that these redundant correlations can significantly hurt the reliability of explanations. To address this, we propose an estimated causal inference method to mitigate the interference of context to achieve high-quality MLLM…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
