Towards Efficient and Effective Text-to-Video Retrieval with   Coarse-to-Fine Visual Representation Learning

Kaibin Tian; Yanhua Cheng; Yi Liu; Xinglin Hou; Quan Chen; and Han Li

arXiv:2401.00701·cs.CV·January 2, 2024·1 cites

Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning

Kaibin Tian, Yanhua Cheng, Yi Liu, Xinglin Hou, Quan Chen, and Han Li

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a two-stage text-to-video retrieval framework that uses multi-granularity visual features and a coarse-to-fine approach, significantly improving efficiency while maintaining high accuracy.

Contribution

The paper proposes a novel coarse-to-fine retrieval architecture with a parameter-free text-gated interaction block and Pearson Constraint, enhancing feature utilization and retrieval speed.

Findings

01

Achieves comparable performance to state-of-the-art methods

02

Nearly 50 times faster retrieval speed

03

Effective multi-granularity visual feature learning

Abstract

In recent years, text-to-video retrieval methods based on CLIP have experienced rapid development. The primary direction of evolution is to exploit the much wider gamut of visual and textual cues to achieve alignment. Concretely, those methods with impressive performance often design a heavy fusion block for sentence (words)-video (frames) interaction, regardless of the prohibitive computation complexity. Nevertheless, these approaches are not optimal in terms of feature utilization and retrieval efficiency. To address this issue, we adopt multi-granularity visual feature learning, ensuring the model's comprehensiveness in capturing visual content features spanning from abstract to detailed levels during the training phase. To better leverage the multi-granularity features, we devise a two-stage retrieval architecture in the retrieval phase. This solution ingeniously balances the coarse…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

adxcreative/EERCF
pytorchOfficial

Videos

Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization

MethodsContrastive Language-Image Pre-training