DEX-AR: A Dynamic Explainability Method for Autoregressive Vision-Language Models
Walid Bousselham, Angie Boggust, Hendrik Strobelt, Hilde Kuehne

TL;DR
DEX-AR is a novel explainability method for autoregressive vision-language models that generates detailed, dynamic visual explanations at both token and sequence levels, improving interpretability and evaluation metrics.
Contribution
It introduces a dynamic head filtering and sequence-level filtering approach to interpret complex autoregressive VLMs, addressing limitations of traditional explainability methods.
Findings
Improves perturbation-based interpretability metrics
Enhances segmentation-based explanation quality
Demonstrates effectiveness on multiple vision-language benchmarks
Abstract
As Vision-Language Models (VLMs) become increasingly sophisticated and widely used, it becomes more and more crucial to understand their decision-making process. Traditional explainability methods, designed for classification tasks, struggle with modern autoregressive VLMs due to their complex token-by-token generation process and intricate interactions between visual and textual modalities. We present DEX-AR (Dynamic Explainability for AutoRegressive models), a novel explainability method designed to address these challenges by generating both per-token and sequence-level 2D heatmaps highlighting image regions crucial for the model's textual responses. The proposed method offers to interpret autoregressive VLMs-including varying importance of layers and generated tokens-by computing layer-wise gradients with respect to attention maps during the token-by-token generation process. DEX-AR…
Peer Reviews
Decision·Submitted to ICLR 2026
1. Clear presentation and well writing. 2. Good generalizability: Demonstrates consistent performance improvements across multiple VLM architectures, including \textbf{decoder-only}, \textbf{encoder-decoder}. Outperforms baselines on both perturbation and segmentation tasks. 3. Comprehensive evaluation: Provides thorough analysis using diverse metrics like normalized perplexity, insertion/deletion tests, and segmentation IoU scores.
1. This paper shares similarities with TAM[1], which uses forward logits and causal inference to assess correlations between visual inputs, prompt texts, and answer sequence. The key distinction is the use of gradient evaluations across layers and heads. This paper should systematically compare the two approaches on: algorithm complexity; \textbf{technical differences and advantages}; test both methods on difficult scenarios (e.g., multi-object scenes, occlusions, ambiguous prompts). 2. Gradie
1)The paper addresses a critical gap in explainability for autoregressive VLMs by providing per-token explanations during sequential generation. This is particularly valuable given the widespread deployment of VLMs where understanding decision-making processes is crucial for trust and debugging. 2)Dynamic Head Filtering: The attention head filtering mechanism that identifies heads focused on visual information represents a meaningful contribution to understanding cross-modal attention patterns.
1. Clarity Issues The paper suffers from several theoretical gaps that undermine the rigor of the proposed method. Most critically, the intermediate logits computation in Section 3.2 lacks clear justification for why o^{l,t} should be conditioned only on the last generated token. While this conditioning may stem from the autoregressive structure, the authors fail to explicitly explain how causal masking affects this choice, why this specific conditioning is optimal for attribution, or whether al
* The gap of pure explainability on autoregressive VLMs, i.e. for question-answering is indeed seems to be an issue. I also think that VLMs should be handled appropriately in terms of explainability, so it is a very important topic and at least for me it seems that it is underexplored, hence novel. * The results of the method seems to be better than other existing methods which they compare with, moreover they compare with several cutting-edge VLMs.
* The convention of the citations embedded in the article is weird and super not convenient. I saw submissions with blue citations, some are still black, but I did not see missing parentheses. This make the citations blended within the flow of the sentence without clear separation. Very confusing, this must be fixed. * Lines 51-77 - the claim of the inability of current explainability methods to act on autoregressive VLMs is too decisive. There are a plethora of works, some of them are modality-
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Topic Modeling
