Grounding Everything in Tokens for Multimodal Large Language Models
Xiangxuan Ren, Zhongdao Wang, Liping Hou, Pin Tang, Guoqing Wang, Chao Ma

TL;DR
GETok introduces a novel token-based spatial grounding method for multimodal large language models, enhancing their ability to localize objects in 2D space without altering the core architecture.
Contribution
It proposes a new tokenization approach that embeds spatial relationships directly into tokens, improving 2D grounding in MLLMs without changing the autoregressive model structure.
Findings
GETok outperforms state-of-the-art methods on referring tasks.
It enables precise iterative localization of objects in 2D images.
GETok improves spatial reasoning in MLLMs in both supervised and reinforcement learning settings.
Abstract
Multimodal large language models (MLLMs) have made significant advancements in vision understanding and reasoning. However, the autoregressive Transformer architecture used by MLLMs requries tokenization on input images, which limits their ability to accurately ground objects within the 2D image space. This raises an important question: how can sequential language tokens be improved to better ground objects in 2D spatial space for MLLMs? To address this, we present a spatial representation method for grounding objects, namely GETok, that integrates a specialized vocabulary of learnable tokens into MLLMs. GETok first uses grid tokens to partition the image plane into structured spatial anchors, and then exploits offset tokens to enable precise and iterative refinement of localization predictions. By embedding spatial relationships directly into tokens, GETok significantly advances MLLMs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
