GroundVTS: Visual Token Sampling in Multimodal Large Language Models for Video Temporal Grounding

Rong Fan; Kaiyan Xiao; Minghao Zhu; Liuyi Wang; Kai Dai; Zhao Yang

arXiv:2604.02093·cs.CV·April 3, 2026

GroundVTS: Visual Token Sampling in Multimodal Large Language Models for Video Temporal Grounding

Rong Fan, Kaiyan Xiao, Minghao Zhu, Liuyi Wang, Kai Dai, Zhao Yang

PDF

1 Repo

TL;DR

GroundVTS introduces a query-guided visual token sampling method for video temporal grounding, significantly improving the accuracy of temporal localization in videos by focusing on informative segments.

Contribution

It proposes a novel query-guided token filtering mechanism and a progressive optimization strategy for better temporal modeling in Vid-LLMs.

Findings

01

Achieves 7.7-point higher mIoU in moment retrieval.

02

Attains 12.0-point higher mAP in highlight detection.

03

Outperforms existing methods on three VTG benchmarks.

Abstract

Video temporal grounding (VTG) is a critical task in video understanding and a key capability for extending video large language models (Vid-LLMs) to broader applications. However, existing Vid-LLMs rely on uniform frame sampling to extract video information, resulting in a sparse distribution of key frames and the loss of crucial temporal cues. To address this limitation, we propose Grounded Visual Token Sampling (GroundVTS), a Vid-LLM architecture that focuses on the most informative temporal segments. GroundVTS employs a fine-grained, query-guided mechanism to filter visual tokens before feeding them into the LLM, thereby preserving essential spatio-temporal information and maintaining temporal coherence. Futhermore, we introduce a progressive optimization strategy that enables the LLM to effectively adapt to the non-uniform distribution of visual features, enhancing its ability to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Florence365/GroundVTS
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.