Semantic Conditioned Dynamic Modulation for Temporal Sentence Grounding in Videos
Yitian Yuan, Lin Ma, Jingwen Wang, Wei Liu, Wenwu Zhu

TL;DR
This paper introduces a semantic conditioned dynamic modulation mechanism that enhances temporal sentence grounding in videos by better aligning sentence semantics with video content, leading to improved accuracy.
Contribution
The paper proposes a novel SCDM mechanism that dynamically modulates temporal convolutions based on sentence semantics, improving video-sentence correlation for grounding.
Findings
Outperforms state-of-the-art methods on three datasets
Demonstrates improved accuracy in localizing target video segments
Shows effectiveness of dynamic semantic modulation in temporal modeling
Abstract
Temporal sentence grounding in videos aims to detect and localize one target video segment, which semantically corresponds to a given sentence. Existing methods mainly tackle this task via matching and aligning semantics between a sentence and candidate video segments, while neglect the fact that the sentence information plays an important role in temporally correlating and composing the described contents in videos. In this paper, we propose a novel semantic conditioned dynamic modulation (SCDM) mechanism, which relies on the sentence semantics to modulate the temporal convolution operations for better correlating and composing the sentence related video contents over time. More importantly, the proposed SCDM performs dynamically with respect to the diverse video contents so as to establish a more precise matching relationship between sentence and video, thereby improving the temporal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization
MethodsConvolution
