Direct Visual Grounding by Directing Attention of Visual Tokens

Parsa Esmaeilkhani; Longin Jan Latecki

arXiv:2511.12738·cs.CV·November 18, 2025

Direct Visual Grounding by Directing Attention of Visual Tokens

Parsa Esmaeilkhani, Longin Jan Latecki

PDF

Open Access

TL;DR

This paper introduces a novel loss function that directly supervises visual token attention in vision-language models, significantly improving their performance on various visual grounding tasks without requiring additional labels.

Contribution

The authors propose KL attention loss (KLAL), a new method that aligns visual token attention with ground truth maps, enhancing VLMs' ability to focus on relevant visual information.

Findings

01

Improved accuracy on geometric and grounding tasks.

02

Enhanced attention to relevant visual tokens during answer generation.

03

Commercial VLMs perform poorly on line tracing tasks.

Abstract

Vision Language Models (VLMs) mix visual tokens and text tokens. A puzzling issue is the fact that visual tokens most related to the query receive little to no attention in the final layers of the LLM module of VLMs from the answer tokens, where all tokens are treated equally, in particular, visual and language tokens in the LLM attention layers. This fact may result in wrong answers to visual questions, as our experimental results confirm. It appears that the standard next-token prediction (NTP) loss provides an insufficient signal for directing attention to visual tokens. We hypothesize that a more direct supervision of the attention of visual tokens to corresponding language tokens in the LLM module of VLMs will lead to improved performance on visual tasks. To demonstrate that this is indeed the case, we propose a novel loss function that directly supervises the attention of visual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications