CoVR-R:Reason-Aware Composed Video Retrieval

Omkar Thawakar; Dmitry Demidov; Vaishnav Potlapalli; Sai Prasanna Teja Reddy Bogireddy; Viswanatha Reddy Gajjala; Alaa Mostafa Lasheen; Rao Muhammad Anwer; Fahad Khan

arXiv:2603.20190·cs.CV·March 23, 2026

CoVR-R:Reason-Aware Composed Video Retrieval

Omkar Thawakar, Dmitry Demidov, Vaishnav Potlapalli, Sai Prasanna Teja Reddy Bogireddy, Viswanatha Reddy Gajjala, Alaa Mostafa Lasheen, Rao Muhammad Anwer, Fahad Khan

PDF

Open Access

TL;DR

This paper introduces CoVR-R, a reasoning-based, zero-shot approach for composed video retrieval that effectively captures implicit after-effects of edits, outperforming baselines and enhancing interpretability.

Contribution

It proposes a novel reasoning-first, zero-shot method leveraging large multimodal models for compositional video retrieval, along with a new benchmark for evaluating reasoning capabilities.

Findings

01

Outperforms strong retrieval baselines on recall at K.

02

Excels on implicit-effect subsets requiring reasoning.

03

Higher step consistency and effect factuality in retrieved videos.

Abstract

Composed Video Retrieval (CoVR) aims to find a target video given a reference video and a textual modification. Prior work assumes the modification text fully specifies the visual changes, overlooking after-effects and implicit consequences (e.g., motion, state transitions, viewpoint or duration cues) that emerge from the edit. We argue that successful CoVR requires reasoning about these after-effects. We introduce a reasoning-first, zero-shot approach that leverages large multimodal models to (i) infer causal and temporal consequences implied by the edit, and (ii) align the resulting reasoned queries to candidate videos without task-specific finetuning. To evaluate reasoning in CoVR, we also propose CoVR-Reason, a benchmark that pairs each (reference, edit, target) triplet with structured internal reasoning traces and challenging distractors that require predicting after-effects rather…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques