Let Me Finish My Sentence: Video Temporal Grounding with Holistic Text   Understanding

Jongbhin Woo; Hyeonggon Ryu; Youngjoon Jang; Jae Won Cho; Joon Son; Chung

arXiv:2410.13598·cs.CV·October 18, 2024

Let Me Finish My Sentence: Video Temporal Grounding with Holistic Text Understanding

Jongbhin Woo, Hyeonggon Ryu, Youngjoon Jang, Jae Won Cho, Joon Son, Chung

PDF

Open Access

TL;DR

This paper proposes a novel approach for Video Temporal Grounding that emphasizes holistic understanding of text queries, improving the accuracy of identifying relevant video segments by incorporating global sentence meaning.

Contribution

It introduces a frame-level gate mechanism and a cross-modal alignment loss to enhance the model's understanding of query semantics and improve grounding performance.

Findings

01

Outperforms state-of-the-art VTG methods on benchmark datasets.

02

The holistic text understanding improves focus on semantically relevant video frames.

03

Regularization with alignment loss enhances correlation between text and visual content.

Abstract

Video Temporal Grounding (VTG) aims to identify visual frames in a video clip that match text queries. Recent studies in VTG employ cross-attention to correlate visual frames and text queries as individual token sequences. However, these approaches overlook a crucial aspect of the problem: a holistic understanding of the query sentence. A model may capture correlations between individual word tokens and arbitrary visual frames while possibly missing out on the global meaning. To address this, we introduce two primary contributions: (1) a visual frame-level gate mechanism that incorporates holistic textual information, (2) cross-modal alignment loss to learn the fine-grained correlation between query and relevant frames. As a result, we regularize the effect of individual word tokens and suppress irrelevant visual frames. We demonstrate that our method outperforms state-of-the-art…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Natural Language Processing Techniques

MethodsFocus · Contrastive Language-Image Pre-training