MolmoPoint: Better Pointing for VLMs with Grounding Tokens

Christopher Clark; Yue Yang; Jae Sung Park; Zixian Ma; Jieyu Zhang; Rohun Tripathi; Mohammadreza Salehi; Sangho Lee; Taira Anderson; Winson Han; Ranjay Krishna

arXiv:2603.28069·cs.CV·March 31, 2026

MolmoPoint: Better Pointing for VLMs with Grounding Tokens

Christopher Clark, Yue Yang, Jae Sung Park, Zixian Ma, Jieyu Zhang, Rohun Tripathi, Mohammadreza Salehi, Sangho Lee, Taira Anderson, Winson Han, Ranjay Krishna

PDF

1 Datasets

TL;DR

MolmoPoint introduces an intuitive visual pointing mechanism for vision-language models, replacing coordinate-based methods with direct token selection, leading to state-of-the-art results in image, GUI, and video pointing tasks.

Contribution

The paper proposes a novel token-based pointing method for VLMs that improves accuracy, efficiency, and interpretability over traditional coordinate-based approaches.

Findings

01

Achieved 70.7% on PointBench for image pointing

02

Set new state-of-the-art on ScreenSpotPro for GUI pointing with 61.1%

03

Improved video pointing and tracking performance significantly

Abstract

Grounding has become a fundamental capability of vision-language models (VLMs). Most existing VLMs point by generating coordinates as part of their text output, which requires learning a complicated coordinate system and results in a high token count. Instead, we propose a more intuitive pointing mechanism that directly selects the visual tokens that contain the target concept. Our model generates a special pointing token that cross-attends to the input image or video tokens and selects the appropriate one. To make this model more fine-grained, we follow these pointing tokens with an additional special token that selects a fine-grained subpatch within the initially selected region, and then a third token that specifies a location within that subpatch. We further show that performance improves by generating points sequentially in a consistent order, encoding the relative position of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

allenai/MolmoPoint-GUISyn
dataset· 3.5k dl
3.5k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.