Scaling Sim-to-Real Reinforcement Learning for Robot VLAs with Generative 3D Worlds
Andrew Choi, Xinjie Wang, Zhizhong Su, Wei Xu

TL;DR
This paper demonstrates that using generative 3D world models for simulation enables scalable, diverse training of vision-language-action models in robotics, improving sim-to-real transfer and generalization.
Contribution
The authors introduce a method leveraging 3D generative models and a language-driven scene designer to efficiently generate diverse training scenes for RL fine-tuning of VLAs.
Findings
Simulation success increased from 9.7% to 79.8%.
Real-world success improved from 21.7% to 75%.
Scene diversity enhances zero-shot generalization.
Abstract
The strong performance of large vision-language models (VLMs) trained with reinforcement learning (RL) has motivated similar approaches for fine-tuning vision-language-action (VLA) models in robotics. Many recent works fine-tune VLAs directly in the real world to avoid addressing the sim-to-real gap. While real-world RL circumvents sim-to-real issues, it inherently limits the generality of the resulting VLA, as scaling scene and object diversity in the physical world is prohibitively difficult. This leads to the paradoxical outcome of transforming a broadly pretrained model into an overfitted, scene-specific policy. Training in simulation can instead provide access to diverse scenes, but designing those scenes is also costly. In this work, we show that VLAs can be RL fine-tuned without sacrificing generality and with reduced labor by leveraging 3D world generative models. Using these…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
