Direct Visual Grounding by Directing Attention of Visual Tokens
Parsa Esmaeilkhani, Longin Jan Latecki

TL;DR
This paper introduces a novel loss function that directly supervises visual token attention in vision-language models, significantly improving their performance on various visual grounding tasks without requiring additional labels.
Contribution
The authors propose KL attention loss (KLAL), a new method that aligns visual token attention with ground truth maps, enhancing VLMs' ability to focus on relevant visual information.
Findings
Improved accuracy on geometric and grounding tasks.
Enhanced attention to relevant visual tokens during answer generation.
Commercial VLMs perform poorly on line tracing tasks.
Abstract
Vision Language Models (VLMs) mix visual tokens and text tokens. A puzzling issue is the fact that visual tokens most related to the query receive little to no attention in the final layers of the LLM module of VLMs from the answer tokens, where all tokens are treated equally, in particular, visual and language tokens in the LLM attention layers. This fact may result in wrong answers to visual questions, as our experimental results confirm. It appears that the standard next-token prediction (NTP) loss provides an insufficient signal for directing attention to visual tokens. We hypothesize that a more direct supervision of the attention of visual tokens to corresponding language tokens in the LLM module of VLMs will lead to improved performance on visual tasks. To demonstrate that this is indeed the case, we propose a novel loss function that directly supervises the attention of visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
