MomentSeeker: A Task-Oriented Benchmark For Long-Video Moment Retrieval

Huaying Yuan; Jian Ni; Zheng Liu; Yueze Wang; Junjie Zhou; Zhengyang Liang; Bo Zhao; Zhao Cao; Zhicheng Dou; Ji-Rong Wen

arXiv:2502.12558·cs.CV·January 13, 2026

MomentSeeker: A Task-Oriented Benchmark For Long-Video Moment Retrieval

Huaying Yuan, Jian Ni, Zheng Liu, Yueze Wang, Junjie Zhou, Zhengyang Liang, Bo Zhao, Zhao Cao, Zhicheng Dou, Ji-Rong Wen

PDF

Open Access 1 Datasets

TL;DR

MomentSeeker introduces a comprehensive benchmark for long-video moment retrieval, featuring diverse, lengthy videos and multiple query types, to advance research in accurate and efficient long-video understanding.

Contribution

It provides a new, diverse, and challenging benchmark for long-video moment retrieval, enabling evaluation of various approaches across multiple real-world scenarios and query modalities.

Findings

01

Current models struggle with accuracy and efficiency on long videos.

02

Latest MLLMs show limited improvements in long-video retrieval.

03

Benchmark facilitates future research in long-video understanding.

Abstract

Accurately locating key moments within long videos is crucial for solving long video understanding (LVU) tasks. However, existing benchmarks are either severely limited in terms of video length and task diversity, or they focus solely on the end-to-end LVU performance, making them inappropriate for evaluating whether key moments can be accurately accessed. To address this challenge, we propose MomentSeeker, a novel benchmark for long-video moment retrieval (LMVR), distinguished by the following features. First, it is created based on long and diverse videos, averaging over 1200 seconds in duration and collected from various domains, e.g., movie, anomaly, egocentric, and sports. Second, it covers a variety of real-world scenarios in three levels: global-level, event-level, object-level, covering common tasks like action recognition, object localization, and causal reasoning, etc. Third,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

avery00/MomentSeeker
dataset· 11k dl
11k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Advanced Vision and Imaging · Human Pose and Action Recognition

MethodsFocus