MolmoPoint: Better Pointing for VLMs with Grounding Tokens
Christopher Clark, Yue Yang, Jae Sung Park, Zixian Ma, Jieyu Zhang, Rohun Tripathi, Mohammadreza Salehi, Sangho Lee, Taira Anderson, Winson Han, Ranjay Krishna

TL;DR
MolmoPoint introduces an intuitive visual pointing mechanism for vision-language models, replacing coordinate-based methods with direct token selection, leading to state-of-the-art results in image, GUI, and video pointing tasks.
Contribution
The paper proposes a novel token-based pointing method for VLMs that improves accuracy, efficiency, and interpretability over traditional coordinate-based approaches.
Findings
Achieved 70.7% on PointBench for image pointing
Set new state-of-the-art on ScreenSpotPro for GUI pointing with 61.1%
Improved video pointing and tracking performance significantly
Abstract
Grounding has become a fundamental capability of vision-language models (VLMs). Most existing VLMs point by generating coordinates as part of their text output, which requires learning a complicated coordinate system and results in a high token count. Instead, we propose a more intuitive pointing mechanism that directly selects the visual tokens that contain the target concept. Our model generates a special pointing token that cross-attends to the input image or video tokens and selects the appropriate one. To make this model more fine-grained, we follow these pointing tokens with an additional special token that selects a fine-grained subpatch within the initially selected region, and then a third token that specifies a location within that subpatch. We further show that performance improves by generating points sequentially in a consistent order, encoding the relative position of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
