Beyond Closed-Pool Video Retrieval: A Benchmark and Agent Framework for Real-World Video Search and Moment Localization

Tao Yu; Yujia Yang; Haopeng Jin; Junhao Gong; Xinlong Chen; Yuxuan Zhou; Shanbin Zhang; Jiabing Yang; Xinming Wang; Hongzhu Yi; Ping Nie; Kai Zou; Zhang Zhang; Yan Huang; Liang Wang; Yeshani; Ruiwen Tao; Jin Ma; Haijin Liang; Jinwen Luo

arXiv:2602.10159·cs.CV·February 12, 2026

Beyond Closed-Pool Video Retrieval: A Benchmark and Agent Framework for Real-World Video Search and Moment Localization

Tao Yu, Yujia Yang, Haopeng Jin, Junhao Gong, Xinlong Chen, Yuxuan Zhou, Shanbin Zhang, Jiabing Yang, Xinming Wang, Hongzhu Yi, Ping Nie, Kai Zou, Zhang Zhang, Yan Huang, Liang Wang, Yeshani, Ruiwen Tao, Jin Ma, Haijin Liang, Jinwen Luo

PDF

Open Access 1 Datasets

TL;DR

This paper introduces RVMS-Bench, a new benchmark with diverse, real-world video samples and a hierarchical description framework, along with RACLO, an agentic reasoning framework, to improve video search and localization based on fuzzy, multi-dimensional memories.

Contribution

It presents RVMS-Bench for evaluating real-world video memory search and RACLO, an agentic framework employing abductive reasoning for more realistic video retrieval and localization.

Findings

01

Existing models underperform on real-world fuzzy memory tasks.

02

RVMS-Bench covers diverse categories and durations from open-web videos.

03

RACLO improves search accuracy by mimicking human recall processes.

Abstract

Traditional video retrieval benchmarks focus on matching precise descriptions to closed video pools, failing to reflect real-world searches characterized by fuzzy, multi-dimensional memories on the open web. We present \textbf{RVMS-Bench}, a comprehensive system for evaluating real-world video memory search. It consists of \textbf{1,440 samples} spanning \textbf{20 diverse categories} and \textbf{four duration groups}, sourced from \textbf{real-world open-web videos}. RVMS-Bench utilizes a hierarchical description framework encompassing \textbf{Global Impression, Key Moment, Temporal Context, and Auditory Memory} to mimic realistic multi-dimensional search cues, with all samples strictly verified via a human-in-the-loop protocol. We further propose \textbf{RACLO}, an agentic framework that employs abductive reasoning to simulate the human ``Recall-Search-Verify'' cognitive process,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

tencent/RVMS-Bench
dataset· 484 dl
484 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization