World-R1: Reinforcing 3D Constraints for Text-to-Video Generation

Weijie Wang; Xiaoxuan He; Youping Gu; Yifan Yang; Zeyu Zhang; Yefei He; Yanbo Ding; Xirui Hu; Donny Y. Chen; Zhiyuan He; Yuqing Yang; Bohan Zhuang

arXiv:2604.24764·cs.CV·May 21, 2026

World-R1: Reinforcing 3D Constraints for Text-to-Video Generation

Weijie Wang, Xiaoxuan He, Youping Gu, Yifan Yang, Zeyu Zhang, Yefei He, Yanbo Ding, Xirui Hu, Donny Y. Chen, Zhiyuan He, Yuqing Yang, Bohan Zhuang

PDF

1 Repo 1 Datasets

TL;DR

World-R1 is a reinforcement learning framework that improves 3D geometric consistency in text-to-video generation without high computational costs, using a specialized dataset and pre-trained models.

Contribution

It introduces a novel reinforcement learning approach with a specialized dataset and a decoupled training strategy to enhance 3D consistency in video synthesis.

Findings

01

Significantly improves 3D structural coherence in generated videos.

02

Maintains original visual quality of the foundation model.

03

Balances geometric consistency with scene fluidity effectively.

Abstract

Recent video foundation models demonstrate impressive visual synthesis but frequently suffer from geometric inconsistencies. While existing methods attempt to inject 3D priors via architectural modifications, they often incur high computational costs and limit scalability. We propose World-R1, a framework that aligns video generation with 3D constraints through reinforcement learning. To facilitate this alignment, we introduce a specialized pure text dataset tailored for world simulation. Utilizing Flow-GRPO, we optimize the model using feedback from pre-trained 3D foundation models and vision-language models to enforce structural coherence without altering the underlying architecture. We further employ a periodic decoupled training strategy to balance rigid geometric consistency with dynamic scene fluidity. Extensive evaluations reveal that our approach significantly enhances 3D…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

microsoft/World-R1
github

Datasets

microsoft/World-R1
dataset· 686 dl
686 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.