LongVidSearch: An Agentic Benchmark for Multi-hop Evidence Retrieval Planning in Long Videos
Rongyi Yu, Chenyuan Duan, Wentao Zhang

TL;DR
LongVidSearch introduces a standardized benchmark to evaluate multi-hop evidence retrieval planning in long videos, emphasizing retrieval necessity and enabling fair comparison of agentic retrieval strategies.
Contribution
It provides a new benchmark with enforced multi-hop retrieval requirements, standardized access interface, and comprehensive evaluation metrics for agentic long-video question answering.
Findings
GPT-5 achieves highest accuracy at 42.43%.
Performance drops significantly without gold evidence clips.
Retrieval planning remains the primary challenge.
Abstract
Long video question answering (Long-Video QA) increasingly relies on agentic tool use to retrieve evidence from long videos. In realistic settings, this process often requires multi-hop retrieval, where agents must iteratively gather multiple discontinuous evidence clips. However, existing long-video benchmarks are largely static: they rarely enforce strict multi-hop retrieval and typically lack a standardized evidence-access interface, making it difficult to separate failures in retrieval planning from those in answer generation. To address this gap, we introduce LongVidSearch, a benchmark for evaluating agentic multi-hop evidence retrieval planning in long videos under standardized access constraints. LongVidSearch enforces retrieval necessity: a Hop-k question requires exactly k necessary evidence clips, and removing any single clip renders the question unsolvable. The benchmark…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
