Learning text-to-video retrieval from image captioning
Lucas Ventura, Cordelia Schmid, G\"ul Varol

TL;DR
This paper introduces a novel training protocol for text-to-video retrieval that leverages unlabeled videos and labeled images, using image captioning to generate supervision signals, and achieves superior results over existing zero-shot baselines.
Contribution
The paper presents a new method for training text-to-video retrieval models using unlabeled videos and labeled images, avoiding manual annotation and outperforming zero-shot CLIP baselines.
Findings
Outperforms CLIP zero-shot baseline on three datasets.
Uses image captioning to automatically label video frames.
Effective domain adaptation without manual annotation.
Abstract
We describe a protocol to study text-to-video retrieval training with unlabeled videos, where we assume (i) no access to labels for any videos, i.e., no access to the set of ground-truth captions, but (ii) access to labeled images in the form of text. Using image expert models is a realistic scenario given that annotating images is cheaper therefore scalable, in contrast to expensive video labeling schemes. Recently, zero-shot image experts such as CLIP have established a new strong baseline for video understanding tasks. In this paper, we make use of this progress and instantiate the image experts from two types of models: a text-to-image retrieval model to provide an initial backbone, and image captioning models to provide supervision signal into unlabeled videos. We show that automatically labeling video frames with image captioning allows text-to-video retrieval training. This…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Subtitles and Audiovisual Media · Video Analysis and Summarization
MethodsSparse Evolutionary Training · Contrastive Language-Image Pre-training
