TL;DR
DeepVideo-R1 introduces a novel reinforcement learning approach for Video Large Language Models, using regression-based advantage prediction and difficulty-aware augmentation to enhance reasoning capabilities.
Contribution
It proposes Reg-GRPO, a reformulation of GRPO as a regression task, and a difficulty-aware augmentation strategy, addressing key issues in training VideoLLMs.
Findings
Significant improvement in video reasoning benchmarks
Effective mitigation of advantage vanishing problem
Enhanced diversity in reward signals
Abstract
Recent works have demonstrated the effectiveness of reinforcement learning (RL)-based post-training for enhancing the reasoning capabilities of large language models (LLMs). In particular, Group Relative Policy Optimization (GRPO) has shown impressive success using a PPO-style reinforcement learning algorithm with group-normalized rewards. However, the effectiveness of GRPO in Video Large Language Models (VideoLLMs) remains underexplored. In this paper, we explore GRPO and identify two issues that hinder effective learning: (1) reliance on safeguards, and (2) vanishing advantage. To mitigate these challenges, we propose DeepVideo-R1, a video large language model trained with Reg-GRPO (Regressive GRPO) and difficulty-aware data augmentation. Reg-GRPO reformulates the GRPO loss function as a regression task that directly predicts the advantage in GRPO, eliminating the need for safeguards…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
