Rethinking Reward Signals in Video GRPO: When Scores Become Targets
Rui Li, Yuanzhi Liang, Ziqi Ni, Haibing Huang, Chi Zhang, Xuelong Li

TL;DR
This paper introduces TaRoS, a novel reward signaling framework for Video GRPO that addresses reward fidelity issues, leading to more reliable and effective video generation.
Contribution
It proposes TaRoS, which uses component-level assessment and adaptive downweighting to improve reward robustness and prevent reward hacking in Video GRPO.
Findings
Improved visual fidelity in generated videos
Enhanced motion coherence and text-video alignment
Reduced reward saturation and shortcut optimization
Abstract
Group Relative Policy Optimization (GRPO) enables stable and preference-oriented updates via group-wise comparisons for post-training video generation. However, GRPO directly optimizes reward-induced advantages. Under sustained optimization, the reward score can lose fidelity as a proxy for true video quality, consistent with the phenomenon described by Goodhart's Law. This leads to two recurring issues: (i) shortcut-driven optimization under composite objectives and (ii) reward saturation within prompt groups. To address these issues, we introduce TaRoS, a Target-Robust Reward Signaling framework for Video generation GRPO. TaRoS leverages component level performance assessment together with intra-group sparsity to organize multi-aspect rewards towards optimization objectives. In addition, it adaptively downweights components that exhibit saturation, thereby preserving effective…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
