Reasoning to Attend: Try to Understand How <SEG> Token Works

Rui Qian; Xin Yin; Dejing Dou

arXiv:2412.17741·cs.CV·March 17, 2026

Reasoning to Attend: Try to Understand How <SEG> Token Works

Rui Qian, Xin Yin, Dejing Dou

PDF

Open Access 1 Repo

TL;DR

This paper investigates the role of <SEG> tokens in large multimodal models, visualizes their semantic similarity responses, and proposes READ, a method to enhance reasoning by leveraging similarity maps for better object localization.

Contribution

The paper provides the first visualization of <SEG> token activations, revealing their role in semantic matching, and introduces READ, a novel approach that improves reasoning in multimodal models using similarity-guided points.

Findings

01

<SEG> tokens focus on semantic similarity within image-text pairs.

02

READ improves reasoning and localization capabilities.

03

The approach maintains performance without catastrophic forgetting.

Abstract

Current Large Multimodal Models (LMMs) empowered visual grounding typically rely on $<SEG>$ tokens as a text prompt to jointly optimize the vision-language model (e.g., LLaVA) and the downstream task-specific model (e.g., SAM). However, we observe that little research has looked into how it works.In this work, we first visualize the similarity maps, which are obtained by computing the semantic similarity between the $<SEG>$ token and the image token embeddings derived from the last hidden layer in both the LLaVA encoder and SAM decoder. Intriguingly, we have found that a striking consistency holds in terms of activation responses in the similarity map, which reveals that what the $<SEG>$ token contributes to is semantic similarity within image-text pairs. Specifically, the $<SEG>$ token, a placeholder expanded in text vocabulary, extensively queries…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

rui-qian/read
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEthics and Social Impacts of AI

MethodsSegment Anything Model