Text-Video Retrieval With Global-Local Contrastive Consistency Learning

Xiaolun Jing; Xinxing Yang; Genke Yang

arXiv:2605.17959·cs.IR·May 19, 2026

Text-Video Retrieval With Global-Local Contrastive Consistency Learning

Xiaolun Jing, Xinxing Yang, Genke Yang

PDF

TL;DR

This paper introduces GLCCL, a novel method for text-video retrieval that enhances semantic alignment using a parameter-free interaction module and a contrastive score consistency loss, improving efficiency and accuracy.

Contribution

The paper proposes a simple, effective approach combining a parameter-free interaction module and a novel contrastive loss for better text-video semantic alignment.

Findings

01

Outperforms existing methods on MSR-VTT, DiDeMo, and VATEX datasets.

02

CSC loss improves discriminative power between positive and hard negative pairs.

03

GLCCL achieves better retrieval accuracy with lower computational overhead.

Abstract

Text-video retrieval aims to find the most semantically similar videos with given text queries. However, since videos contain more diverse content than texts, the main semantics expressed by each text-video pair is often partially relevant. The primary methods involve the utilization of language-video attention module to align texts and videos. Though effective, this paradigm inevitably introduces prohibitive computational overhead, resulting in inefficient retrieval. In this paper, we propose a simple yet effective method called Global-Local Contrastive Consistency Learning (GLCCL) to achieve texts and videos semantics alignment. Specifically, we design a parameter-free Global-Local Interaction Module (GLIM) to generate semantic-related frame and video features in a text-guided manner. Furthermore, a Contrastive Score Consistency (CSC) loss is developed to promote consistency learning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.