SPARROW: Learning Spatial Precision and Temporal Referential Consistency in Pixel-Grounded Video MLLMs

Mohamad Alansari; Naufal Suryanto; Divya Velayudhan; Sajid Javed; Naoufel Werghi; Muzammal Naseer

arXiv:2603.12382·cs.CV·March 16, 2026

SPARROW: Learning Spatial Precision and Temporal Referential Consistency in Pixel-Grounded Video MLLMs

Mohamad Alansari, Naufal Suryanto, Divya Velayudhan, Sajid Javed, Naoufel Werghi, Muzammal Naseer

PDF

Open Access

TL;DR

SPARROW is a novel pixel-grounded video MLLM that enhances spatial precision and temporal stability by integrating target-specific features and dual-prompt decoding, significantly improving performance across multiple benchmarks.

Contribution

It introduces SPARROW, a new approach combining target-specific features and dual-prompt design to improve pixel-level grounding in videos, supported by a large curated dataset.

Findings

01

Achieves up to +8.9 J&F on RVOS benchmark.

02

Improves 5 mIoU on visual grounding tasks.

03

Enhances temporal coherence and spatial accuracy in pixel-grounded video understanding.

Abstract

Multimodal large language models (MLLMs) have advanced from image-level reasoning to pixel-level grounding, but extending these capabilities to videos remains challenging as models must achieve spatial precision and temporally consistent reference tracking. Existing video MLLMs often rely on a static segmentation token ([SEG]) for frame-wise grounding, which provides semantics but lacks temporal context, causing spatial drift, identity switches, and unstable initialization when objects move or reappear. We introduce SPARROW, a pixel-grounded video MLLM that unifies spatial accuracy and temporal stability through two key components: (i) Target-Specific Tracked Features (TSF), which inject temporally aligned referent cues during training, and (ii) a dual-prompt design that decodes box ([BOX]) and segmentation ([SEG]) tokens to fuse geometric priors with semantic grounding. SPARROW is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Neural Network Applications