Point to Span: Zero-Shot Moment Retrieval for Navigating Unseen Hour-Long Videos
Mingyu Jeon, Jisoo Yang, Sungjin Han, Jinkwon Hwang, Sunjae Yoon, Jonghee Kim, Junyeoung Kim

TL;DR
This paper introduces P2S, a training-free zero-shot framework for long video moment retrieval that efficiently narrows down candidates and refines results without high-cost verification, outperforming supervised methods.
Contribution
P2S is the first zero-shot approach for hour-long video temporal grounding, addressing search and refine challenges with adaptive span generation and query decomposition.
Findings
Outperforms supervised state-of-the-art by +3.7% on [email protected] on MAD
First zero-shot framework capable of hour-long video temporal grounding
Effectively reduces computational overhead in long video retrieval
Abstract
Zero-shot Long Video Moment Retrieval (ZLVMR) is the task of identifying temporal segments in hour-long videos using a natural language query without task-specific training. The core technical challenge of LVMR stems from the computational infeasibility of processing entire lengthy videos in a single pass. This limitation has established a 'Search-then-Refine' approach, where candidates are rapidly narrowed down, and only those portions are analyzed, as the dominant paradigm for LVMR. However, existing approaches to this paradigm face severe limitations. Conventional supervised learning suffers from limited scalability and poor generalization, despite substantial resource consumption. Yet, existing zero-shot methods also fail, facing a dual challenge: (1) their heuristic strategies cause a 'search' phase candidate explosion, and (2) the 'refine' phase, which is vulnerable to semantic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Human Pose and Action Recognition
