Are We Ready for RL in Text-to-3D Generation? A Progressive Investigation
Yiwen Tang, Zoey Guo, Kaixin Zhu, Ray Zhang, Qizhi Chen, Dongzhi Jiang, Junli Liu, Bohan Zeng, Haoming Song, Delin Qu, Tianyi Bai, Dan Xu, Wentao Zhang, Bin Zhao

TL;DR
This paper systematically investigates reinforcement learning for text-to-3D generation, addressing reward design, algorithm choices, and benchmarks, leading to the development of the first RL-enhanced text-to-3D model with hierarchical optimization.
Contribution
It introduces a comprehensive study of RL in 3D generation, proposes new reward and algorithm strategies, and develops AR3D-R1, the first RL-based text-to-3D model.
Findings
Alignment with human preferences is crucial for reward design.
Token-level optimization improves RL effectiveness.
The new MME-3DR benchmark measures implicit reasoning in 3D models.
Abstract
Reinforcement learning (RL), earlier proven to be effective in large language and multi-modal models, has been successfully extended to enhance 2D image generation recently. However, applying RL to 3D generation remains largely unexplored due to the higher spatial complexity of 3D objects, which require globally consistent geometry and fine-grained local textures. This makes 3D generation significantly sensitive to reward designs and RL algorithms. To address these challenges, we conduct the first systematic study of RL for text-to-3D autoregressive generation across several dimensions. (1) Reward designs: We evaluate reward dimensions and model choices, showing that alignment with human preference is crucial, and that general multi-modal models provide robust signal for 3D attributes. (2) RL algorithms: We study GRPO variants, highlighting the effectiveness of token-level optimization,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Shape Modeling and Analysis · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications
