LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding
Jihao Qiu, Lingxi Xie, Xinyue Huo, Qi Tian, Qixiang Ye

TL;DR
This paper introduces LongVideo-R1, a multimodal large language model agent that efficiently navigates long videos by reasoning with high-level cues, reducing redundant processing and improving question-answering accuracy under low computational budgets.
Contribution
The paper presents a novel reasoning-equipped multimodal LLM agent for efficient long video understanding, leveraging hierarchical captions and a two-stage training paradigm including reinforcement learning.
Findings
LongVideo-R1 achieves a superior tradeoff between QA accuracy and efficiency.
The model effectively infers informative video clips using high-level visual cues.
Experiments validate the model's ability to reduce redundant video processing.
Abstract
This paper addresses the critical and underexplored challenge of long video understanding with low computational budgets. We propose LongVideo-R1, an active, reasoning-equipped multimodal large language model (MLLM) agent designed for efficient video context navigation, avoiding the redundancy of exhaustive search. At the core of LongVideo-R1 lies a reasoning module that leverages high-level visual cues to infer the most informative video clip for subsequent processing. During inference, the agent initiates traversal from top-level visual summaries and iteratively refines its focus, immediately halting the exploration process upon acquiring sufficient knowledge to answer the query. To facilitate training, we first extract hierarchical video captions from CGBench, a video corpus with grounding annotations, and guide GPT-5 to generate 33K high-quality chain-of-thought-with-tool…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
