VideoExplorer: Think With Videos For Agentic Long-Video Understanding

Huaying Yuan; Zheng Liu; Junjie Zhou; Hongjin Qian; Yan Shu; Nicu Sebe; Ji-Rong Wen; Zhicheng Dou

arXiv:2506.10821·cs.CV·November 4, 2025

VideoExplorer: Think With Videos For Agentic Long-Video Understanding

Huaying Yuan, Zheng Liu, Junjie Zhou, Hongjin Qian, Yan Shu, Nicu Sebe, Ji-Rong Wen, Zhicheng Dou

PDF

Open Access 1 Repo 1 Datasets

TL;DR

VideoExplorer introduces an iterative, question-driven framework for long-video understanding that enhances reasoning accuracy, interpretability, and efficiency by integrating planning, temporal grounding, and perception.

Contribution

It proposes a novel reasoning framework that formulates sub-questions and locates relevant video segments iteratively, along with a new dataset and training pipeline for long-video understanding.

Findings

01

Outperforms existing methods on long-video reasoning benchmarks

02

Demonstrates robustness and adaptability across tasks

03

Achieves more faithful and interpretable reasoning processes

Abstract

Long-video understanding~(LVU) is a challenging problem in computer vision. Existing methods either downsample frames for single-pass reasoning, sacrificing fine-grained details, or depend on textual reasoning over task-agnostic representations, hindering task-specific perception and exploration. In this paper, we propose VideoExplorer, a framework grounded in the principle of ``thinking with video'', which naturally intertwines planning, temporal grounding, and scalable perception into a coherent reasoning process. Rather than reasoning over a static context, VideoExplorer iteratively formulates sub-questions, locates relevant moments, and performs task-oriented, temporally scalable video understanding until reaching the final answer, enabling faithful, efficient, and interpretable reasoning. To address the lack of LVU training resources, we construct a long-video reasoning dataset…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yhy-2000/videodeepresearch
pytorchOfficial

Datasets

avery00/VideoExplorer-Dataset
dataset· 58 dl
58 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Robot Manipulation and Learning