From Play to Replay: Composed Video Retrieval for Temporally Fine-Grained Videos

Animesh Gupta; Jay Parmar; Ishan Rajendrakumar Dave; Mubarak Shah

arXiv:2506.05274·cs.CV·November 21, 2025

From Play to Replay: Composed Video Retrieval for Temporally Fine-Grained Videos

Animesh Gupta, Jay Parmar, Ishan Rajendrakumar Dave, Mubarak Shah

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

This paper introduces TF-CoVR, a large-scale benchmark for temporally fine-grained composed video retrieval, and proposes a two-stage training framework that significantly improves retrieval accuracy on sports videos.

Contribution

The paper presents TF-CoVR, the first benchmark for fine-grained temporal video retrieval, and a novel training method, TF-CoVR-Base, that enhances model performance in this challenging task.

Findings

01

TF-CoVR benchmark contains 180K triplets from sports videos.

02

TF-CoVR-Base improves zero-shot mAP@50 from 5.92 to 7.51.

03

Fine-tuning with TF-CoVR-Base raises state-of-the-art mAP@50 from 19.83 to 27.22.

Abstract

Composed Video Retrieval (CoVR) retrieves a target video given a query video and a modification text describing the intended change. Existing CoVR benchmarks emphasize appearance shifts or coarse event changes and therefore do not test the ability to capture subtle, fast-paced temporal differences. We introduce TF-CoVR, the first large-scale benchmark dedicated to temporally fine-grained CoVR. TF-CoVR focuses on gymnastics and diving, and provides 180K triplets drawn from FineGym and FineDiving datasets. Previous CoVR benchmarks, focusing on temporal aspect, link each query to a single target segment taken from the same video, limiting practical usefulness. In TF-CoVR, we instead construct each <query, modification> pair by prompting an LLM with the label differences between clips drawn from different videos; every pair is thus associated with multiple valid target videos (3.9 on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ucf-crcv/tf-covr
pytorchOfficial

Datasets

ucf-crcv/TF-CoVR
dataset· 81 dl
81 dl

Videos

From Play to Replay: Composed Video Retrieval for Temporally Fine-Grained Videos· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning

MethodsALIGN · Composed Video Retrieval