Preserving Localized Patch Semantics in VLMs
Parsa Esmaeilkhani, Longin Jan Latecki

TL;DR
This paper introduces Logit Lens Loss (LLL), a method to preserve localized visual semantics in vision-language models, enhancing explainability and improving vision tasks without architectural changes.
Contribution
The paper proposes LLL, a novel loss that maintains visual token locality in VLMs, enabling better visualization and performance on vision-centric tasks.
Findings
LLL produces meaningful object confidence maps.
LLL improves segmentation performance.
Visual token locality is preserved with LLL.
Abstract
Logit Lens has been proposed for visualizing tokens that contribute most to LLM answers. Recently, Logit Lens was also shown to be applicable in autoregressive Vision-Language Models (VLMs), where it illustrates the conceptual content of image tokens in the form of heatmaps, e.g., which image tokens are likely to depict the concept of cat in a given image. However, the visual content of image tokens often gets diffused to language tokens, and consequently, the locality of visual information gets mostly destroyed, which renders Logit Lens visualization unusable for explainability. To address this issue, we introduce a complementary loss to next-token prediction (NTP) to prevent the visual tokens from losing the visual representation inherited from corresponding image patches. The proposed Logit Lens Loss (LLL) is designed to make visual token embeddings more semantically aligned with the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
