3D-Generalist: Self-Improving Vision-Language-Action Models for Crafting 3D Worlds

Fan-Yun Sun; Shengguang Wu; Christian Jacobsen; Thomas Yim; Haoming Zou; Alex Zook; Shangru Li; Yu-Hsin Chou; Ethem Can; Xunlei Wu; Clemens Eppner; Valts Blukis; Jonathan Tremblay; Jiajun Wu; Stan Birchfield; Nick Haber

arXiv:2507.06484·cs.GR·August 21, 2025

3D-Generalist: Self-Improving Vision-Language-Action Models for Crafting 3D Worlds

Fan-Yun Sun, Shengguang Wu, Christian Jacobsen, Thomas Yim, Haoming Zou, Alex Zook, Shangru Li, Yu-Hsin Chou, Ethem Can, Xunlei Wu, Clemens Eppner, Valts Blukis, Jonathan Tremblay, Jiajun Wu, Stan Birchfield, Nick Haber

PDF

Open Access

TL;DR

This paper introduces 3D-Generalist, a scalable framework that uses self-improving vision-language models to generate high-quality 3D environments, enhancing spatial reasoning in foundation models for applications like VR, gaming, and robotics.

Contribution

It recasts 3D environment creation as a sequential decision process and employs self-improving vision-language models to automate high-quality 3D environment generation.

Findings

01

Pretraining vision models on generated 3D data improves downstream task performance.

02

The method surpasses models trained on human-crafted synthetic data.

03

Approaches results of models trained on real data with much larger datasets.

Abstract

Despite large-scale pretraining endowing models with language and vision reasoning capabilities, improving their spatial reasoning capability remains challenging due to the lack of data grounded in the 3D world. While it is possible for humans to manually create immersive and interactive worlds through 3D graphics, as seen in applications such as VR, gaming, and robotics, this process remains highly labor-intensive. In this paper, we propose a scalable method for generating high-quality 3D environments that can serve as training data for foundation models. We recast 3D environment building as a sequential decision-making problem, employing Vision-Language-Models (VLMs) as policies that output actions to jointly craft a 3D environment's layout, materials, lighting, and assets. Our proposed framework, 3D-Generalist, trains VLMs to generate more prompt-aligned 3D environments via…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · 3D Shape Modeling and Analysis · Robotics and Sensor-Based Localization