LongVidSearch: An Agentic Benchmark for Multi-hop Evidence Retrieval Planning in Long Videos

Rongyi Yu; Chenyuan Duan; Wentao Zhang

arXiv:2603.14468·cs.CV·March 17, 2026

LongVidSearch: An Agentic Benchmark for Multi-hop Evidence Retrieval Planning in Long Videos

Rongyi Yu, Chenyuan Duan, Wentao Zhang

PDF

Open Access 1 Datasets

TL;DR

LongVidSearch introduces a standardized benchmark to evaluate multi-hop evidence retrieval planning in long videos, emphasizing retrieval necessity and enabling fair comparison of agentic retrieval strategies.

Contribution

It provides a new benchmark with enforced multi-hop retrieval requirements, standardized access interface, and comprehensive evaluation metrics for agentic long-video question answering.

Findings

01

GPT-5 achieves highest accuracy at 42.43%.

02

Performance drops significantly without gold evidence clips.

03

Retrieval planning remains the primary challenge.

Abstract

Long video question answering (Long-Video QA) increasingly relies on agentic tool use to retrieve evidence from long videos. In realistic settings, this process often requires multi-hop retrieval, where agents must iteratively gather multiple discontinuous evidence clips. However, existing long-video benchmarks are largely static: they rarely enforce strict multi-hop retrieval and typically lack a standardized evidence-access interface, making it difficult to separate failures in retrieval planning from those in answer generation. To address this gap, we introduce LongVidSearch, a benchmark for evaluating agentic multi-hop evidence retrieval planning in long videos under standardized access constraints. LongVidSearch enforces retrieval necessity: a Hop-k question requires exactly k necessary evidence clips, and removing any single clip renders the question unsolvable. The benchmark…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Fishiing/LongVidSearch
dataset· 412 dl
412 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning