Co-generation of Layout and Shape from Text via Autoregressive 3D Diffusion
Zhenggang Tang, Yuehao Wang, Yuchen Fan, Jun-Kun Chen, Yu-Ying Yeh, Kihyuk Sohn, Zhangyang Wang, Qixing Huang, Alexander Schwing, Rakesh Ranjan, Dilin Wang, Zhicheng Yan

TL;DR
This paper introduces a novel autoregressive diffusion model for sequential text-to-3D scene generation, enabling detailed and semantically consistent indoor scene creation from textual descriptions.
Contribution
It proposes 3D-ARD+, a unified generative model that combines autoregressive token sequence modeling with diffusion-based 3D object generation, advancing the realism and accuracy of scene synthesis.
Findings
The model can generate complex indoor scenes with non-trivial spatial arrangements.
It outperforms existing methods in generating semantically consistent 3D scenes.
A large dataset of 230K indoor scenes was curated for training.
Abstract
Recent text-to-scene generation approaches largely reduced the manual efforts required to create 3D scenes. However, their focus is either to generate a scene layout or to generate objects, and few generate both. The generated scene layout is often simple even with LLM's help. Moreover, the generated scene is often inconsistent with the text input that contains non-trivial descriptions of the shape, appearance, and spatial arrangement of the objects. We present a new paradigm of sequential text-to-scene generation and propose a novel generative model for interactive scene creation. At the core is a 3D Autoregressive Diffusion model 3D-ARD+, which unifies the autoregressive generation over a multimodal token sequence and diffusion generation of next-object 3D latents. To generate the next object, the model uses one autoregressive step to generate the coarse-grained 3D latents in the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
