Let Me Finish My Sentence: Video Temporal Grounding with Holistic Text Understanding
Jongbhin Woo, Hyeonggon Ryu, Youngjoon Jang, Jae Won Cho, Joon Son, Chung

TL;DR
This paper proposes a novel approach for Video Temporal Grounding that emphasizes holistic understanding of text queries, improving the accuracy of identifying relevant video segments by incorporating global sentence meaning.
Contribution
It introduces a frame-level gate mechanism and a cross-modal alignment loss to enhance the model's understanding of query semantics and improve grounding performance.
Findings
Outperforms state-of-the-art VTG methods on benchmark datasets.
The holistic text understanding improves focus on semantically relevant video frames.
Regularization with alignment loss enhances correlation between text and visual content.
Abstract
Video Temporal Grounding (VTG) aims to identify visual frames in a video clip that match text queries. Recent studies in VTG employ cross-attention to correlate visual frames and text queries as individual token sequences. However, these approaches overlook a crucial aspect of the problem: a holistic understanding of the query sentence. A model may capture correlations between individual word tokens and arbitrary visual frames while possibly missing out on the global meaning. To address this, we introduce two primary contributions: (1) a visual frame-level gate mechanism that incorporates holistic textual information, (2) cross-modal alignment loss to learn the fine-grained correlation between query and relevant frames. As a result, we regularize the effect of individual word tokens and suppress irrelevant visual frames. We demonstrate that our method outperforms state-of-the-art…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Natural Language Processing Techniques
MethodsFocus · Contrastive Language-Image Pre-training
