Beyond Caption-Based Queries for Video Moment Retrieval

David Pujol-Perich; Albert Clap\'es; Dima Damen; Sergio Escalera; Michael Wray

arXiv:2603.02363·cs.CV·March 4, 2026

Beyond Caption-Based Queries for Video Moment Retrieval

David Pujol-Perich, Albert Clap\'es, Dima Damen, Sergio Escalera, Michael Wray

PDF

Open Access

TL;DR

This paper investigates the limitations of existing video moment retrieval methods when generalizing from caption-based training to search queries, identifying key challenges and proposing architectural improvements that significantly enhance performance.

Contribution

The authors introduce benchmarks for evaluating generalization in VMR, analyze key challenges, and propose architectural modifications to improve multi-moment query retrieval performance.

Findings

01

Performance improved by up to 14.82% mAP_m on search queries.

02

Performance improved by up to 21.83% mAP_m on multi-moment queries.

03

Identified active decoder-query collapse as a key issue affecting generalization.

Abstract

In this work, we investigate the degradation of existing VMR methods, particularly of DETR architectures, when trained on caption-based queries but evaluated on search queries. For this, we introduce three benchmarks by modifying the textual queries in three public VMR datasets -- i.e., HD-EPIC, YouCook2 and ActivityNet-Captions. Our analysis reveals two key generalization challenges: (i) A language gap, arising from the linguistic under-specification of search queries, and (ii) a multi-moment gap, caused by the shift from single-moment to multi-moment queries. We also identify a critical issue in these architectures -- an active decoder-query collapse -- as a primary cause of the poor generalization to multi-moment instances. We mitigate this issue with architectural modifications that effectively increase the number of active decoder queries. Extensive experiments demonstrate that our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning