From Play to Replay: Composed Video Retrieval for Temporally Fine-Grained Videos
Animesh Gupta, Jay Parmar, Ishan Rajendrakumar Dave, Mubarak Shah

TL;DR
This paper introduces TF-CoVR, a large-scale benchmark for temporally fine-grained composed video retrieval, and proposes a two-stage training framework that significantly improves retrieval accuracy on sports videos.
Contribution
The paper presents TF-CoVR, the first benchmark for fine-grained temporal video retrieval, and a novel training method, TF-CoVR-Base, that enhances model performance in this challenging task.
Findings
TF-CoVR benchmark contains 180K triplets from sports videos.
TF-CoVR-Base improves zero-shot mAP@50 from 5.92 to 7.51.
Fine-tuning with TF-CoVR-Base raises state-of-the-art mAP@50 from 19.83 to 27.22.
Abstract
Composed Video Retrieval (CoVR) retrieves a target video given a query video and a modification text describing the intended change. Existing CoVR benchmarks emphasize appearance shifts or coarse event changes and therefore do not test the ability to capture subtle, fast-paced temporal differences. We introduce TF-CoVR, the first large-scale benchmark dedicated to temporally fine-grained CoVR. TF-CoVR focuses on gymnastics and diving, and provides 180K triplets drawn from FineGym and FineDiving datasets. Previous CoVR benchmarks, focusing on temporal aspect, link each query to a single target segment taken from the same video, limiting practical usefulness. In TF-CoVR, we instead construct each <query, modification> pair by prompting an LLM with the label differences between clips drawn from different videos; every pair is thus associated with multiple valid target videos (3.9 on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning
MethodsALIGN · Composed Video Retrieval
