Learning text-to-video retrieval from image captioning

Lucas Ventura; Cordelia Schmid; G\"ul Varol

arXiv:2404.17498·cs.CV·April 29, 2024

Learning text-to-video retrieval from image captioning

Lucas Ventura, Cordelia Schmid, G\"ul Varol

PDF

Open Access

TL;DR

This paper introduces a novel training protocol for text-to-video retrieval that leverages unlabeled videos and labeled images, using image captioning to generate supervision signals, and achieves superior results over existing zero-shot baselines.

Contribution

The paper presents a new method for training text-to-video retrieval models using unlabeled videos and labeled images, avoiding manual annotation and outperforming zero-shot CLIP baselines.

Findings

01

Outperforms CLIP zero-shot baseline on three datasets.

02

Uses image captioning to automatically label video frames.

03

Effective domain adaptation without manual annotation.

Abstract

We describe a protocol to study text-to-video retrieval training with unlabeled videos, where we assume (i) no access to labels for any videos, i.e., no access to the set of ground-truth captions, but (ii) access to labeled images in the form of text. Using image expert models is a realistic scenario given that annotating images is cheaper therefore scalable, in contrast to expensive video labeling schemes. Recently, zero-shot image experts such as CLIP have established a new strong baseline for video understanding tasks. In this paper, we make use of this progress and instantiate the image experts from two types of models: a text-to-image retrieval model to provide an initial backbone, and image captioning models to provide supervision signal into unlabeled videos. We show that automatically labeling video frames with image captioning allows text-to-video retrieval training. This…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Subtitles and Audiovisual Media · Video Analysis and Summarization

MethodsSparse Evolutionary Training · Contrastive Language-Image Pre-training