LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding

Jihao Qiu; Lingxi Xie; Xinyue Huo; Qi Tian; Qixiang Ye

arXiv:2602.20913·cs.CV·April 16, 2026

LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding

Jihao Qiu, Lingxi Xie, Xinyue Huo, Qi Tian, Qixiang Ye

PDF

1 Repo 2 Models 1 Datasets

TL;DR

This paper introduces LongVideo-R1, a multimodal large language model agent that efficiently navigates long videos by reasoning with high-level cues, reducing redundant processing and improving question-answering accuracy under low computational budgets.

Contribution

The paper presents a novel reasoning-equipped multimodal LLM agent for efficient long video understanding, leveraging hierarchical captions and a two-stage training paradigm including reinforcement learning.

Findings

01

LongVideo-R1 achieves a superior tradeoff between QA accuracy and efficiency.

02

The model effectively infers informative video clips using high-level visual cues.

03

Experiments validate the model's ability to reduce redundant video processing.

Abstract

This paper addresses the critical and underexplored challenge of long video understanding with low computational budgets. We propose LongVideo-R1, an active, reasoning-equipped multimodal large language model (MLLM) agent designed for efficient video context navigation, avoiding the redundancy of exhaustive search. At the core of LongVideo-R1 lies a reasoning module that leverages high-level visual cues to infer the most informative video clip for subsequent processing. During inference, the agent initiates traversal from top-level visual summaries and iteratively refines its focus, immediately halting the exploration process upon acquiring sufficient knowledge to answer the query. To facilitate training, we first extract hierarchical video captions from CGBench, a video corpus with grounding annotations, and guide GPT-5 to generate 33K high-quality chain-of-thought-with-tool…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

qiujihao19/LongVideo-R1
github

Models

Datasets

ChurchillQAQ/LongVideo-R1-Data
dataset· 49 dl
49 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.