Wan-R1: Verifiable-Reinforcement Learning for Video Reasoning
Ming Liu, Yunbei Zhang, Shilong Liu, Liwen Wang, and Wensheng Zhang

TL;DR
This paper introduces verifiable reward functions for reinforcement learning in video reasoning tasks, improving generalization and training stability in maze-solving and robotic navigation.
Contribution
It systematically studies reward design in RL for video reasoning, proposing verifiable rewards that enhance robustness and generalization over multimodal reward models.
Findings
Verifiable rewards improve exact match accuracy by 29.1% in maze tasks.
Verifiable rewards enhance trap-avoidance performance by 51.4%.
Multimodal reward models can cause degenerate solutions, unlike verifiable rewards.
Abstract
Video generation models produce visually coherent content but struggle with tasks requiring spatial reasoning and multi-step planning. Reinforcement learning (RL) offers a path to improve generalization, but its effectiveness in video reasoning hinges on reward design -- a challenge that has received little systematic study. We investigate this problem by adapting Group Relative Policy Optimization (GRPO) to flow-based video models and training them on maze-solving and robotic navigation tasks. We first show that multimodal reward models fail catastrophically in this setting. To address this, we design verifiable reward functions grounded in objective task metrics. For structured game environments, we introduce a multi-component trajectory reward. For robotic navigation, we propose an embedding-level verifiable reward. Our experiments show that RL fine-tuning with verifiable rewards…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
