Video sentence grounding with temporally global textual knowledge

Cai Chen; Runzhong Zhang; Jianjun Gao; Kejun Wu; Kim-Hui Yap; Yi Wang

arXiv:2404.13611·cs.CV·June 4, 2024

Video sentence grounding with temporally global textual knowledge

Cai Chen, Runzhong Zhang, Jianjun Gao, Kejun Wu, Kim-Hui Yap, Yi Wang

PDF

Open Access

TL;DR

This paper introduces a novel approach for temporal sentence grounding in videos by leveraging pseudo-query features with global textual knowledge to improve multi-modal feature alignment and grounding accuracy.

Contribution

We propose the Pseudo-query Intermediary Network (PIN) that uses contrastive learning and learnable prompts to better align visual and textual features for temporal grounding.

Findings

01

Significant improvement on Charades-STA dataset

02

Effective bridging of domain gap between modalities

03

Enhanced feature alignment with pseudo-query knowledge

Abstract

Temporal sentence grounding involves the retrieval of a video moment with a natural language query. Many existing works directly incorporate the given video and temporally localized query for temporal grounding, overlooking the inherent domain gap between different modalities. In this paper, we utilize pseudo-query features containing extensive temporally global textual knowledge sourced from the same video-query pair, to enhance the bridging of domain gaps and attain a heightened level of similarity between multi-modal features. Specifically, we propose a Pseudo-query Intermediary Network (PIN) to achieve an improved alignment of visual and comprehensive pseudo-query features within the feature space through contrastive learning. Subsequently, we utilize learnable prompts to encapsulate the knowledge of pseudo-queries, propagating them into the textual encoder and multi-modal fusion…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Natural Language Processing Techniques · Multimodal Machine Learning Applications