Scaling RL to Long Videos
Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Hanrong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, Sifei Liu, Hongxu Yin, Yao Lu, Song Han

TL;DR
This paper presents a comprehensive framework for scaling reinforcement learning to long videos in vision-language models, including a large dataset, a novel training pipeline, and an efficient training infrastructure, achieving state-of-the-art results and supporting diverse modalities.
Contribution
The paper introduces LongVideo-Reason dataset, a two-stage training pipeline with chain-of-thought fine-tuning and RL, and a new training infrastructure MR-SP for long video reasoning in VLMs.
Findings
Achieved 65.1% accuracy on VideoMME without subtitles.
Supported processing of up to 8,192 video frames per video.
MR-SP system speeds up long video RL training by 2.1x.
Abstract
We introduce a full-stack framework that scales up reasoning in vision-language models (VLMs) to long videos, leveraging reinforcement learning. We address the unique challenges of long video reasoning by integrating three critical components: (1) a large-scale dataset, LongVideo-Reason, comprising 104K long video QA pairs with high-quality reasoning annotations across diverse domains such as sports, games, and vlogs; (2) a two-stage training pipeline that extends VLMs with chain-of-thought supervised fine-tuning (CoT-SFT) and reinforcement learning (RL); and (3) a training infrastructure for long video RL, named Multi-modal Reinforcement Sequence Parallelism (MR-SP), which incorporates sequence parallelism and a vLLM-based engine tailored for long video, using cached video embeddings for efficient rollout and prefilling. In our experiments, LongVILA-R1-7B achieves strong performance on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsVideo Analysis and Summarization · Advanced Vision and Imaging
