Text-Video Retrieval With Global-Local Contrastive Consistency Learning
Xiaolun Jing, Xinxing Yang, Genke Yang

TL;DR
This paper introduces GLCCL, a novel method for text-video retrieval that enhances semantic alignment using a parameter-free interaction module and a contrastive score consistency loss, improving efficiency and accuracy.
Contribution
The paper proposes a simple, effective approach combining a parameter-free interaction module and a novel contrastive loss for better text-video semantic alignment.
Findings
Outperforms existing methods on MSR-VTT, DiDeMo, and VATEX datasets.
CSC loss improves discriminative power between positive and hard negative pairs.
GLCCL achieves better retrieval accuracy with lower computational overhead.
Abstract
Text-video retrieval aims to find the most semantically similar videos with given text queries. However, since videos contain more diverse content than texts, the main semantics expressed by each text-video pair is often partially relevant. The primary methods involve the utilization of language-video attention module to align texts and videos. Though effective, this paradigm inevitably introduces prohibitive computational overhead, resulting in inefficient retrieval. In this paper, we propose a simple yet effective method called Global-Local Contrastive Consistency Learning (GLCCL) to achieve texts and videos semantics alignment. Specifically, we design a parameter-free Global-Local Interaction Module (GLIM) to generate semantic-related frame and video features in a text-guided manner. Furthermore, a Contrastive Score Consistency (CSC) loss is developed to promote consistency learning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
