Empower Words: DualGround for Structured Phrase and Sentence-Level Temporal Grounding
Minseok Kang, Minhyeok Lee, Minjung Kim, Donghyeong Kim, and Sangyoun Lee

TL;DR
This paper introduces DualGround, a dual-branch model that explicitly separates global and local semantics for improved temporal grounding in videos, achieving state-of-the-art results by leveraging structured phrase and sentence-level alignment.
Contribution
The paper proposes a novel dual-branch architecture that disentangles global and local semantics, enhancing fine-grained temporal grounding in video-language tasks.
Findings
DualGround outperforms previous models on QVHighlights and Charades-STA benchmarks.
Explicit semantic separation improves both global and localized video-language alignment.
Structured phrase and sentence-level modeling enhances temporal grounding accuracy.
Abstract
Video Temporal Grounding (VTG) aims to localize temporal segments in long, untrimmed videos that align with a given natural language query. This task typically comprises two subtasks: Moment Retrieval (MR) and Highlight Detection (HD). While recent advances have been progressed by powerful pretrained vision-language models such as CLIP and InternVideo2, existing approaches commonly treat all text tokens uniformly during crossmodal attention, disregarding their distinct semantic roles. To validate the limitations of this approach, we conduct controlled experiments demonstrating that VTG models overly rely on [EOS]-driven global semantics while failing to effectively utilize word-level signals, which limits their ability to achieve fine-grained temporal alignment. Motivated by this limitation, we propose DualGround, a dual-branch architecture that explicitly separates global and local…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition
