Retrieving Any Relevant Moments: Benchmark and Models for Generalized Moment Retrieval
Yiming Ding, Siyu Cao, Luyuan Jiao, Yixuan Li, Zitong Wang, Zhiyong Liu, Lu Zhang

TL;DR
This paper introduces Generalized Moment Retrieval (GMR), a new benchmark and models for retrieving multiple relevant video segments or none, reflecting real-world complexities in video-language understanding.
Contribution
It formulates GMR as a unified task, creates Soccer-GMR benchmark, and develops baseline models, advancing the study of realistic video moment retrieval scenarios.
Findings
GMR models outperform traditional VMR in complex scenarios.
Soccer-GMR benchmark enables scalable, high-quality data generation.
Current methods have notable limitations revealed by extensive experiments.
Abstract
Video Moment Retrieval (VMR) aims to localize temporal segments in videos that correspond to a natural language query, but typically assumes only a single matching moment for each query. This assumption does not always hold in real-world scenarios, where queries may correspond to multiple or no moments. Thus, we formulate Generalized Moment Retrieval (GMR), a unified setting that requires retrieving the complete set of relevant moments or predicting an empty set. To enable systematic study of GMR, we introduce Soccer-GMR, a large-scale benchmark built on challenging soccer videos that reflect general GMR scenarios, with realistic negative and positive queries. The benchmark is constructed via a duration-flexible semi-automated pipeline with human verification, enabling scalable data generation while maintaining high annotation quality. We further design a unified evaluation protocol…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
