GLIMPSE: Holistic Cross-Modal Explainability for Large Vision-Language Models
Guanxi Shen

TL;DR
GLIMPSE is a novel, lightweight framework that enhances interpretability of large vision-language models by providing holistic, faithful visual and textual attribution maps, improving understanding of model reasoning and biases.
Contribution
It introduces GLIMPSE, a model-agnostic method combining gradient attention, layer propagation, and token relevance for comprehensive cross-modal explanation.
Findings
Outperforms prior methods in faithfulness and human-attention alignment.
Enables detailed analysis of model reasoning and biases.
Facilitates diagnosis of hallucination and systematic errors.
Abstract
Recent large vision-language models (LVLMs) have advanced capabilities in visual question answering (VQA). However, interpreting where LVLMs direct their visual attention remains a significant challenge, yet is essential for understanding model behavior. We introduce GLIMPSE (Gradient-Layer Importance Mapping for Prompted Visual Saliency Explanation), a lightweight, model-agnostic framework that jointly attributes LVLM outputs to the most relevant visual evidence and textual signals that support open-ended generation. GLIMPSE fuses gradient-weighted attention, adaptive layer propagation, and relevance-weighted token aggregation to produce holistic response-level heat maps for interpreting cross-modal reasoning, outperforming prior methods in faithfulness and pushing the state-of-the-art in human-attention alignment. We demonstrate an analytic approach to uncover fine-grained insights…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
