Reasoning to Attend: Try to Understand How <SEG> Token Works
Rui Qian, Xin Yin, Dejing Dou

TL;DR
This paper investigates the role of <SEG> tokens in large multimodal models, visualizes their semantic similarity responses, and proposes READ, a method to enhance reasoning by leveraging similarity maps for better object localization.
Contribution
The paper provides the first visualization of <SEG> token activations, revealing their role in semantic matching, and introduces READ, a novel approach that improves reasoning in multimodal models using similarity-guided points.
Findings
<SEG> tokens focus on semantic similarity within image-text pairs.
READ improves reasoning and localization capabilities.
The approach maintains performance without catastrophic forgetting.
Abstract
Current Large Multimodal Models (LMMs) empowered visual grounding typically rely on tokens as a text prompt to jointly optimize the vision-language model (e.g., LLaVA) and the downstream task-specific model (e.g., SAM). However, we observe that little research has looked into how it works.In this work, we first visualize the similarity maps, which are obtained by computing the semantic similarity between the token and the image token embeddings derived from the last hidden layer in both the LLaVA encoder and SAM decoder. Intriguingly, we have found that a striking consistency holds in terms of activation responses in the similarity map, which reveals that what the token contributes to is semantic similarity within image-text pairs. Specifically, the token, a placeholder expanded in text vocabulary, extensively queries…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEthics and Social Impacts of AI
MethodsSegment Anything Model
