Point to Span: Zero-Shot Moment Retrieval for Navigating Unseen Hour-Long Videos

Mingyu Jeon; Jisoo Yang; Sungjin Han; Jinkwon Hwang; Sunjae Yoon; Jonghee Kim; Junyeoung Kim

arXiv:2512.10363·cs.CV·December 12, 2025

Point to Span: Zero-Shot Moment Retrieval for Navigating Unseen Hour-Long Videos

Mingyu Jeon, Jisoo Yang, Sungjin Han, Jinkwon Hwang, Sunjae Yoon, Jonghee Kim, Junyeoung Kim

PDF

Open Access

TL;DR

This paper introduces P2S, a training-free zero-shot framework for long video moment retrieval that efficiently narrows down candidates and refines results without high-cost verification, outperforming supervised methods.

Contribution

P2S is the first zero-shot approach for hour-long video temporal grounding, addressing search and refine challenges with adaptive span generation and query decomposition.

Findings

01

Outperforms supervised state-of-the-art by +3.7% on [email protected] on MAD

02

First zero-shot framework capable of hour-long video temporal grounding

03

Effectively reduces computational overhead in long video retrieval

Abstract

Zero-shot Long Video Moment Retrieval (ZLVMR) is the task of identifying temporal segments in hour-long videos using a natural language query without task-specific training. The core technical challenge of LVMR stems from the computational infeasibility of processing entire lengthy videos in a single pass. This limitation has established a 'Search-then-Refine' approach, where candidates are rapidly narrowed down, and only those portions are analyzed, as the dominant paradigm for LVMR. However, existing approaches to this paradigm face severe limitations. Conventional supervised learning suffers from limited scalability and poor generalization, despite substantial resource consumption. Yet, existing zero-shot methods also fail, facing a dual challenge: (1) their heuristic strategies cause a 'search' phase candidate explosion, and (2) the 'refine' phase, which is vulnerable to semantic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Human Pose and Action Recognition