Empower Words: DualGround for Structured Phrase and Sentence-Level Temporal Grounding

Minseok Kang; Minhyeok Lee; Minjung Kim; Donghyeong Kim; and Sangyoun Lee

arXiv:2510.20244·cs.CV·October 24, 2025

Empower Words: DualGround for Structured Phrase and Sentence-Level Temporal Grounding

Minseok Kang, Minhyeok Lee, Minjung Kim, Donghyeong Kim, and Sangyoun Lee

PDF

Open Access 1 Video

TL;DR

This paper introduces DualGround, a dual-branch model that explicitly separates global and local semantics for improved temporal grounding in videos, achieving state-of-the-art results by leveraging structured phrase and sentence-level alignment.

Contribution

The paper proposes a novel dual-branch architecture that disentangles global and local semantics, enhancing fine-grained temporal grounding in video-language tasks.

Findings

01

DualGround outperforms previous models on QVHighlights and Charades-STA benchmarks.

02

Explicit semantic separation improves both global and localized video-language alignment.

03

Structured phrase and sentence-level modeling enhances temporal grounding accuracy.

Abstract

Video Temporal Grounding (VTG) aims to localize temporal segments in long, untrimmed videos that align with a given natural language query. This task typically comprises two subtasks: Moment Retrieval (MR) and Highlight Detection (HD). While recent advances have been progressed by powerful pretrained vision-language models such as CLIP and InternVideo2, existing approaches commonly treat all text tokens uniformly during crossmodal attention, disregarding their distinct semantic roles. To validate the limitations of this approach, we conduct controlled experiments demonstrating that VTG models overly rely on [EOS]-driven global semantics while failing to effectively utilize word-level signals, which limits their ability to achieve fine-grained temporal alignment. Motivated by this limitation, we propose DualGround, a dual-branch architecture that explicitly separates global and local…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Empower Words: DualGround for Structured Phrase and Sentence-Level Temporal Grounding· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition