Scaling RL to Long Videos

Yukang Chen; Wei Huang; Baifeng Shi; Qinghao Hu; Hanrong Ye; Ligeng Zhu; Zhijian Liu; Pavlo Molchanov; Jan Kautz; Xiaojuan Qi; Sifei Liu; Hongxu Yin; Yao Lu; Song Han

arXiv:2507.07966·cs.CV·October 1, 2025

Scaling RL to Long Videos

Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Hanrong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, Sifei Liu, Hongxu Yin, Yao Lu, Song Han

PDF

Open Access 1 Repo 2 Models 2 Datasets 1 Video

TL;DR

This paper presents a comprehensive framework for scaling reinforcement learning to long videos in vision-language models, including a large dataset, a novel training pipeline, and an efficient training infrastructure, achieving state-of-the-art results and supporting diverse modalities.

Contribution

The paper introduces LongVideo-Reason dataset, a two-stage training pipeline with chain-of-thought fine-tuning and RL, and a new training infrastructure MR-SP for long video reasoning in VLMs.

Findings

01

Achieved 65.1% accuracy on VideoMME without subtitles.

02

Supported processing of up to 8,192 video frames per video.

03

MR-SP system speeds up long video RL training by 2.1x.

Abstract

We introduce a full-stack framework that scales up reasoning in vision-language models (VLMs) to long videos, leveraging reinforcement learning. We address the unique challenges of long video reasoning by integrating three critical components: (1) a large-scale dataset, LongVideo-Reason, comprising 104K long video QA pairs with high-quality reasoning annotations across diverse domains such as sports, games, and vlogs; (2) a two-stage training pipeline that extends VLMs with chain-of-thought supervised fine-tuning (CoT-SFT) and reinforcement learning (RL); and (3) a training infrastructure for long video RL, named Multi-modal Reinforcement Sequence Parallelism (MR-SP), which incorporates sequence parallelism and a vLLM-based engine tailored for long video, using cached video embeddings for efficient rollout and prefilling. In our experiments, LongVILA-R1-7B achieves strong performance on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hiyouga/easyr1
pytorchOfficial

Models

Datasets

Videos

Scaling RL to Long Videos· slideslive

Taxonomy

TopicsVideo Analysis and Summarization · Advanced Vision and Imaging