TL;DR
GroundVTS introduces a query-guided visual token sampling method for video temporal grounding, significantly improving the accuracy of temporal localization in videos by focusing on informative segments.
Contribution
It proposes a novel query-guided token filtering mechanism and a progressive optimization strategy for better temporal modeling in Vid-LLMs.
Findings
Achieves 7.7-point higher mIoU in moment retrieval.
Attains 12.0-point higher mAP in highlight detection.
Outperforms existing methods on three VTG benchmarks.
Abstract
Video temporal grounding (VTG) is a critical task in video understanding and a key capability for extending video large language models (Vid-LLMs) to broader applications. However, existing Vid-LLMs rely on uniform frame sampling to extract video information, resulting in a sparse distribution of key frames and the loss of crucial temporal cues. To address this limitation, we propose Grounded Visual Token Sampling (GroundVTS), a Vid-LLM architecture that focuses on the most informative temporal segments. GroundVTS employs a fine-grained, query-guided mechanism to filter visual tokens before feeding them into the LLM, thereby preserving essential spatio-temporal information and maintaining temporal coherence. Futhermore, we introduce a progressive optimization strategy that enables the LLM to effectively adapt to the non-uniform distribution of visual features, enhancing its ability to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
