GLIMPSE: Holistic Cross-Modal Explainability for Large Vision-Language Models

Guanxi Shen

arXiv:2506.18985·cs.CV·July 30, 2025

GLIMPSE: Holistic Cross-Modal Explainability for Large Vision-Language Models

Guanxi Shen

PDF

TL;DR

GLIMPSE is a novel, lightweight framework that enhances interpretability of large vision-language models by providing holistic, faithful visual and textual attribution maps, improving understanding of model reasoning and biases.

Contribution

It introduces GLIMPSE, a model-agnostic method combining gradient attention, layer propagation, and token relevance for comprehensive cross-modal explanation.

Findings

01

Outperforms prior methods in faithfulness and human-attention alignment.

02

Enables detailed analysis of model reasoning and biases.

03

Facilitates diagnosis of hallucination and systematic errors.

Abstract

Recent large vision-language models (LVLMs) have advanced capabilities in visual question answering (VQA). However, interpreting where LVLMs direct their visual attention remains a significant challenge, yet is essential for understanding model behavior. We introduce GLIMPSE (Gradient-Layer Importance Mapping for Prompted Visual Saliency Explanation), a lightweight, model-agnostic framework that jointly attributes LVLM outputs to the most relevant visual evidence and textual signals that support open-ended generation. GLIMPSE fuses gradient-weighted attention, adaptive layer propagation, and relevance-weighted token aggregation to produce holistic response-level heat maps for interpreting cross-modal reasoning, outperforming prior methods in faithfulness and pushing the state-of-the-art in human-attention alignment. We demonstrate an analytic approach to uncover fine-grained insights…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.