CoVR-2: Automatic Data Construction for Composed Video Retrieval
Lucas Ventura, Antoine Yang, Cordelia Schmid, G\"ul Varol

TL;DR
This paper introduces a scalable method for automatically creating large datasets for composed video retrieval by mining video-caption pairs and generating modification texts with a language model, enabling improved retrieval performance.
Contribution
The authors propose an automatic dataset construction approach for CoVR, expanding the scope from images to videos, and demonstrate its effectiveness with new benchmarks and state-of-the-art results.
Findings
Constructed the WebVid-CoVR dataset with 1.6 million triplets.
Achieved improved zero-shot retrieval performance on multiple benchmarks.
Validated the methodology's applicability to image-caption pairs, creating 3.3 million CoIR triplets.
Abstract
Composed Image Retrieval (CoIR) has recently gained popularity as a task that considers both text and image queries together, to search for relevant images in a database. Most CoIR approaches require manually annotated datasets, comprising image-text-image triplets, where the text describes a modification from the query image to the target image. However, manual curation of CoIR triplets is expensive and prevents scalability. In this work, we instead propose a scalable automatic dataset creation methodology that generates triplets given video-caption pairs, while also expanding the scope of the task to include composed video retrieval (CoVR). To this end, we mine paired videos with a similar caption from a large database, and leverage a large language model to generate the corresponding modification text. Applying this methodology to the extensive WebVid2M collection, we automatically…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsComposed Video Retrieval
