Beyond Simple Edits: Composed Video Retrieval with Dense Modifications

Omkar Thawakar; Dmitry Demidov; Ritesh Thawkar; Rao Muhammad Anwer; Mubarak Shah; Fahad Shahbaz Khan; Salman Khan

arXiv:2508.14039·cs.CV·August 20, 2025

Beyond Simple Edits: Composed Video Retrieval with Dense Modifications

Omkar Thawakar, Dmitry Demidov, Ritesh Thawkar, Rao Muhammad Anwer, Mubarak Shah, Fahad Shahbaz Khan, Salman Khan

PDF

TL;DR

This paper introduces Dense-WebVid-CoVR, a large dataset for composed video retrieval with dense modifications, and proposes a new model that achieves state-of-the-art results by integrating visual and textual information through Cross-Attention fusion.

Contribution

The paper presents a novel large-scale dataset for fine-grained composed video retrieval and a new model that effectively aligns dense textual modifications with target videos.

Findings

01

Achieved 71.3% Recall@1 in visual+text retrieval setting.

02

Dataset contains 1.6 million samples with dense modification texts.

03

Model outperforms existing methods on all evaluation metrics.

Abstract

Composed video retrieval is a challenging task that strives to retrieve a target video based on a query video and a textual description detailing specific modifications. Standard retrieval frameworks typically struggle to handle the complexity of fine-grained compositional queries and variations in temporal understanding limiting their retrieval ability in the fine-grained setting. To address this issue, we introduce a novel dataset that captures both fine-grained and composed actions across diverse video segments, enabling more detailed compositional changes in retrieved video content. The proposed dataset, named Dense-WebVid-CoVR, consists of 1.6 million samples with dense modification text that is around seven times more than its existing counterpart. We further develop a new model that integrates visual and textual information through Cross-Attention (CA) fusion using grounded text…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.