D3G: Exploring Gaussian Prior for Temporal Sentence Grounding with Glance Annotation
Hanjun Li, Xiujun Shu, Sunan He, Ruizhi Qiao, Wei Wen, Taian Guo, Bei, Gan, Xing Sun

TL;DR
This paper introduces D3G, a novel weakly supervised framework for temporal sentence grounding that uses glance annotations and Gaussian priors to effectively locate video moments with reduced annotation effort.
Contribution
The paper proposes a Dynamic Gaussian prior based Grounding framework with Glance annotation (D3G), combining semantic alignment and dynamic distribution adjustment for improved weakly supervised TSG.
Findings
Outperforms state-of-the-art weakly supervised methods significantly.
Narrows the performance gap with fully supervised approaches.
Proves effectiveness across three challenging benchmarks.
Abstract
Temporal sentence grounding (TSG) aims to locate a specific moment from an untrimmed video with a given natural language query. Recently, weakly supervised methods still have a large performance gap compared to fully supervised ones, while the latter requires laborious timestamp annotations. In this study, we aim to reduce the annotation cost yet keep competitive performance for TSG task compared to fully supervised ones. To achieve this goal, we investigate a recently proposed glance-supervised temporal sentence grounding task, which requires only single frame annotation (referred to as glance annotation) for each query. Under this setup, we propose a Dynamic Gaussian prior based Grounding framework with Glance annotation (D3G), which consists of a Semantic Alignment Group Contrastive Learning module (SA-GCL) and a Dynamic Gaussian prior Adjustment module (DGA). Specifically, SA-GCL…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
D3G: Exploring Gaussian Prior for Temporal Sentence Grounding with Glance Annotation· youtube
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization
MethodsContrastive Learning
