DeepVideo-R1: Video Reinforcement Fine-Tuning via Difficulty-aware Regressive GRPO

Jinyoung Park; Jeehye Na; Jinyoung Kim; Hyunwoo J. Kim

arXiv:2506.07464·cs.CV·February 5, 2026

DeepVideo-R1: Video Reinforcement Fine-Tuning via Difficulty-aware Regressive GRPO

Jinyoung Park, Jeehye Na, Jinyoung Kim, Hyunwoo J. Kim

PDF

1 Video

TL;DR

DeepVideo-R1 introduces a novel reinforcement learning approach for Video Large Language Models, using regression-based advantage prediction and difficulty-aware augmentation to enhance reasoning capabilities.

Contribution

It proposes Reg-GRPO, a reformulation of GRPO as a regression task, and a difficulty-aware augmentation strategy, addressing key issues in training VideoLLMs.

Findings

01

Significant improvement in video reasoning benchmarks

02

Effective mitigation of advantage vanishing problem

03

Enhanced diversity in reward signals

Abstract

Recent works have demonstrated the effectiveness of reinforcement learning (RL)-based post-training for enhancing the reasoning capabilities of large language models (LLMs). In particular, Group Relative Policy Optimization (GRPO) has shown impressive success using a PPO-style reinforcement learning algorithm with group-normalized rewards. However, the effectiveness of GRPO in Video Large Language Models (VideoLLMs) remains underexplored. In this paper, we explore GRPO and identify two issues that hinder effective learning: (1) reliance on safeguards, and (2) vanishing advantage. To mitigate these challenges, we propose DeepVideo-R1, a video large language model trained with Reg-GRPO (Regressive GRPO) and difficulty-aware data augmentation. Reg-GRPO reformulates the GRPO loss function as a regression task that directly predicts the advantage in GRPO, eliminating the need for safeguards…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

DeepVideo-R1: Video Reinforcement Fine-Tuning via Difficulty-aware Regressive GRPO· slideslive