Text-Video Retrieval with Global-Local Semantic Consistent Learning

Haonan Zhang; Pengpeng Zeng; Lianli Gao; Jingkuan Song; Yihang Duan,; Xinyu Lyu; Hengtao Shen

arXiv:2405.12710·cs.CV·July 17, 2024·2 cites

Text-Video Retrieval with Global-Local Semantic Consistent Learning

Haonan Zhang, Pengpeng Zeng, Lianli Gao, Jingkuan Song, Yihang Duan,, Xinyu Lyu, Hengtao Shen

PDF

Open Access 1 Repo

TL;DR

This paper introduces GLSCL, a novel method for text-video retrieval that leverages shared semantic concepts with minimal computation, achieving state-of-the-art results and significantly faster retrieval.

Contribution

The paper proposes a parameter-free global interaction and a learnable local interaction module for efficient semantic alignment in text-video retrieval.

Findings

01

Achieves comparable performance to SOTA methods.

02

Nearly 220 times faster in computational cost.

03

Validated on five widely used benchmarks.

Abstract

Adapting large-scale image-text pre-training models, e.g., CLIP, to the video domain represents the current state-of-the-art for text-video retrieval. The primary approaches involve transferring text-video pairs to a common embedding space and leveraging cross-modal interactions on specific entities for semantic alignment. Though effective, these paradigms entail prohibitive computational costs, leading to inefficient retrieval. To address this, we propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL), which capitalizes on latent shared semantics across modalities for text-video retrieval. Specifically, we introduce a parameter-free global interaction module to explore coarse-grained alignment. Then, we devise a shared local interaction module that employs several learnable queries to capture latent semantic concepts for learning fine-grained…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zchoi/glscl
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Topic Modeling

MethodsContrastive Language-Image Pre-training