Video Models Can Reason with Verifiable Rewards
Tinghui Zhu, Sheng Zhang, James Y. Huang, Selena Song, Xiaofei Wen, Yuankai Li, Hoifung Poon, Muhao Chen

TL;DR
This paper introduces VideoRLVR, a reinforcement learning approach for optimizing video diffusion models to perform verifiable reasoning tasks with explicit constraints, improving reliability and rule consistency.
Contribution
It presents a novel recipe combining rule-based feedback with diffusion models, including an efficient Early-Step Focus strategy, to enhance verifiable reasoning in videos.
Findings
VideoRLVR outperforms supervised baselines on Maze, FlowFree, and Sokoban.
Dense decomposed rewards improve performance in low-success-rate scenarios.
The approach surpasses proprietary and open-source models on reasoning benchmarks.
Abstract
Video diffusion models have made rapid progress in perceptual realism and temporal coherence, but they remain primarily optimized for plausible generation rather than verifiable reasoning. This limitation is especially pronounced in tasks where generated videos must satisfy explicit spatial, temporal, or logical constraints. Inspired by the role of reinforcement learning with verifiable rewards (RLVR) in reasoning-oriented language models, we introduce VideoRLVR, a practical recipe for optimizing video diffusion models with rule-based feedback. VideoRLVR formulates video reasoning as the generation of verifiable visual trajectories and consists of an SDE-GRPO optimization backbone, dense decomposed rewards, and an Early-Step Focus strategy for efficient training. The Early-Step Focus strategy restricts policy optimization to the early denoising phase, reducing training latency by about…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
