TL;DR
EasyVideoR1 is a specialized reinforcement learning framework that significantly improves training efficiency and evaluation for large vision-language models on diverse video understanding tasks.
Contribution
It introduces a complete, optimized pipeline with task-aware rewards, hybrid data training, and multi-benchmark evaluation tailored for video RL.
Findings
1. Achieves 1.47× throughput improvement through offline preprocessing and tensor caching.
2. Supports 11 video/image problem types with unified reward routing.
3. Reproduces benchmark scores closely aligned with official results.
Abstract
Reinforcement learning from verifiable rewards (RLVR) has demonstrated remarkable effectiveness in improving the reasoning capabilities of large language models. As models evolve into natively multimodal architectures, extending RLVR to video understanding becomes increasingly important yet remains largely unexplored, due to the diversity of video task types, the computational overhead of repeatedly decoding and preprocessing high-dimensional visual inputs, and the difficulty of reproducible evaluation across numerous sensitive hyperparameters. Existing open-source RL training frameworks provide solid infrastructure for text and image scenarios but lack systematic optimizations tailored for video modality. In this work, we present \textbf{EasyVideoR1}, a complete and efficient reinforcement learning framework specifically designed for training large vision-language models on video…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
