Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?

Wenhao Wu; Haipeng Luo; Bo Fang; Jingdong Wang; Wanli Ouyang

arXiv:2301.00184·cs.CV·March 29, 2023·6 cites

Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?

Wenhao Wu, Haipeng Luo, Bo Fang, Jingdong Wang, Wanli Ouyang

PDF

Open Access 4 Repos

TL;DR

Cap4Video introduces a novel framework that leverages automatically generated video captions to enhance text-video retrieval, achieving state-of-the-art results across multiple benchmarks without post-processing.

Contribution

The paper proposes a new approach that uses zero-shot captioning to generate auxiliary captions, improving retrieval performance through data augmentation, feature interaction, and combined scoring.

Findings

01

Achieves state-of-the-art results on four benchmarks.

02

Demonstrates effectiveness of caption-based augmentation.

03

Shows improvements without post-processing.

Abstract

Most existing text-video retrieval methods focus on cross-modal matching between the visual content of videos and textual query sentences. However, in real-world scenarios, online videos are often accompanied by relevant text information such as titles, tags, and even subtitles, which can be utilized to match textual queries. This insight has motivated us to propose a novel approach to text-video retrieval, where we directly generate associated captions from videos using zero-shot video captioning with knowledge from web-scale pre-trained models (e.g., CLIP and GPT-2). Given the generated captions, a natural question arises: what benefits do they bring to text-video retrieval? To answer this, we introduce Cap4Video, a new framework that leverages captions in three ways: i) Input data: video-caption pairs can augment the training data. ii) Intermediate feature interaction: we perform…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques

MethodsContrastive Language-Image Pre-training