Watching, Reasoning, and Searching: A Video Deep Research Benchmark on Open Web for Agentic Video Reasoning

Chengwen Liu; Xiaomin Yu; Zhuoyue Chang; Zhe Huang; Shuo Zhang; Heng Lian; Jisheng Dang; Rui Xu; Sen Hu; Jianheng Hou; Chengwei Qin; Xiaobin Hu; Kunyi Wang; Zhi Yang; Hao Peng; Hong Peng; Ronghao Chen; Huacan Wang

arXiv:2601.06943·cs.CV·May 19, 2026

Watching, Reasoning, and Searching: A Video Deep Research Benchmark on Open Web for Agentic Video Reasoning

Chengwen Liu, Xiaomin Yu, Zhuoyue Chang, Zhe Huang, Shuo Zhang, Heng Lian, Jisheng Dang, Rui Xu, Sen Hu, Jianheng Hou, Chengwei Qin, Xiaobin Hu, Kunyi Wang, Zhi Yang, Hao Peng, Hong Peng, Ronghao Chen, Huacan Wang

PDF

1 Repo 1 Datasets

TL;DR

VideoDR is a new benchmark for open-web video question answering that tests models' abilities in cross-frame extraction, web retrieval, and multi-hop reasoning, highlighting key challenges for future research.

Contribution

This paper introduces VideoDR, the first comprehensive benchmark for open-web video question answering involving cross-frame reasoning and web retrieval, with detailed evaluation of current models.

Findings

01

Agentic models' performance varies with their ability to maintain video anchors.

02

Goal drift and long-horizon consistency are major bottlenecks.

03

VideoDR reveals key challenges for next-generation video reasoning agents.

Abstract

In real-world video question answering scenarios, videos often provide only localized visual cues, while verifiable answers are distributed across the open web; models therefore need to jointly perform cross-frame clue extraction, iterative retrieval, and multi-hop reasoning-based verification. To bridge this gap, we construct the first video deep research benchmark, VideoDR. VideoDR centers on video-conditioned open-domain video question answering, requiring cross-frame visual anchor extraction, interactive web retrieval, and multi-hop reasoning over joint video-web evidence; through rigorous human annotation and quality control, we obtain high-quality video deep research samples spanning six semantic domains. We evaluate multiple closed-source and open-source multimodal large language models under both the Workflow and Agentic paradigms, and the results show that Agentic is not…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

quantaalpha/VideoDR-Benchmark
github

Datasets

Yu2020/VideoDR
dataset· 237 dl
237 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Explainable Artificial Intelligence (XAI)