VideoMolmo: Spatio-Temporal Grounding Meets Pointing
Ghazi Shazan Ahmad, Ahmed Heakl, Hanan Gani, Abdelrahman Shaker, Zhiqiang Shen, Fahad Shahbaz Khan, Salman Khan

TL;DR
VideoMolmo is a novel multimodal model that enhances spatio-temporal localization in videos by integrating language understanding, temporal consistency, and a new dataset, significantly improving pointing accuracy and reasoning across diverse real-world scenarios.
Contribution
We introduce VideoMolmo, a large multimodal model with a temporal module and mask fusion pipeline, along with a new dataset and benchmark for spatio-temporal pointing conditioned on text.
Findings
Outperforms existing models in spatio-temporal pointing accuracy.
Demonstrates strong generalization on out-of-distribution benchmarks.
Provides a new dataset and benchmark for future research.
Abstract
Spatio-temporal localization is vital for precise interactions across diverse domains, from biological research to autonomous navigation and interactive interfaces. Current video-based approaches, while proficient in tracking, lack the sophisticated reasoning capabilities of large language models, limiting their contextual understanding and generalization. We introduce VideoMolmo, a large multimodal model tailored for fine-grained spatio-temporal pointing conditioned on textual descriptions. Building upon the Molmo architecture, VideoMolmo incorporates a temporal module utilizing an attention mechanism to condition each frame on preceding frames, ensuring temporal consistency. Additionally, our novel temporal mask fusion pipeline employs SAM2 for bidirectional point propagation, significantly enhancing coherence across video sequences. This two-step decomposition, i.e., first using the…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- Leveraging of pretrained segmentation models to generate instance masks based on point prompts - Strong performance against baselines
- The assumption that points make it clear what the model means is wrong. The Segment Anything Model already shows that if one points at an eye, the pointing is ambiguous: it could be the eye, the head, or the whole body that is pointed at. - Limited novelty: The paper uses known object-centric principles, such as disentangling object appearance from position. In this case, position is disentangled from the downstream task of mask generation. Additionally, the temporal averaging is a simple mea
1.The paper is well-written and easy to follow for readers. 2.A lot of quantitative experiments have been conducted to verify that VideoMolmo outperforms most previous state-of-the-arts across various downstream tasks.
1.One concern for this work is the motivation of the point-based grounding formulation. As mentioned in the manuscript, the point-level supervision is constructed from mask-level data, and it would be unclear why the point-based formulation would be better compared to other data formats such masks for visual grounding in videos? I think mask-based annotations can also transfer to various other forms like points, bounding boxes, etc. So the rationality of the point-based formulation of this work
- The proposed approach outperforms previous approaches on multiple datasets at multiple metrics. - The evaluation benchmark is exhaustive on multiple aspects of spatio-temporal evaluation.
- Architecture Novelty - Section 4.1: Temporal Module: Aggregating information using past frames feature aggregation is a very common aspect of trivial video understanding. For video segmentation or dense tasks the understanding from frames to patch level aggregation. Earlier works utilize temporal feature aggregation or memory module - the idea is a base setup not a novelty. If there’s something missing I would like authors to clarify. Specifically, if there’s some previous work used as a ba
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Robot Manipulation and Learning
MethodsSoftmax · Attention Is All You Need · VOS
