CoVR-2: Automatic Data Construction for Composed Video Retrieval

Lucas Ventura; Antoine Yang; Cordelia Schmid; G\"ul Varol

arXiv:2308.14746·cs.CV·April 2, 2025

CoVR-2: Automatic Data Construction for Composed Video Retrieval

Lucas Ventura, Antoine Yang, Cordelia Schmid, G\"ul Varol

PDF

1 Repo 1 Datasets

TL;DR

This paper introduces a scalable method for automatically creating large datasets for composed video retrieval by mining video-caption pairs and generating modification texts with a language model, enabling improved retrieval performance.

Contribution

The authors propose an automatic dataset construction approach for CoVR, expanding the scope from images to videos, and demonstrate its effectiveness with new benchmarks and state-of-the-art results.

Findings

01

Constructed the WebVid-CoVR dataset with 1.6 million triplets.

02

Achieved improved zero-shot retrieval performance on multiple benchmarks.

03

Validated the methodology's applicability to image-caption pairs, creating 3.3 million CoIR triplets.

Abstract

Composed Image Retrieval (CoIR) has recently gained popularity as a task that considers both text and image queries together, to search for relevant images in a database. Most CoIR approaches require manually annotated datasets, comprising image-text-image triplets, where the text describes a modification from the query image to the target image. However, manual curation of CoIR triplets is expensive and prevents scalability. In this work, we instead propose a scalable automatic dataset creation methodology that generates triplets given video-caption pairs, while also expanding the scope of the task to include composed video retrieval (CoVR). To this end, we mine paired videos with a similar caption from a large database, and leverage a large language model to generate the corresponding modification text. Applying this methodology to the extensive WebVid2M collection, we automatically…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lucas-ventura/CoVR
pytorchOfficial

Datasets

lucas-ventura/WebVid-CoVR
dataset· 66 dl
66 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsComposed Video Retrieval