Compositional 3D Scene Generation using Locally Conditioned Diffusion
Ryan Po, Gordon Wetzstein

TL;DR
This paper presents a novel locally conditioned diffusion method for compositional 3D scene generation, enabling detailed, controllable, and seamless scene synthesis from text prompts and bounding boxes, surpassing existing object-level methods.
Contribution
The paper introduces a new locally conditioned diffusion approach for scene-level 3D generation, allowing control over semantic parts and improving fidelity over prior object-focused models.
Findings
Higher fidelity scene generation compared to baselines
Effective control over semantic parts with text and bounding boxes
Seamless transitions between scene components
Abstract
Designing complex 3D scenes has been a tedious, manual process requiring domain expertise. Emerging text-to-3D generative models show great promise for making this task more intuitive, but existing approaches are limited to object-level generation. We introduce \textbf{locally conditioned diffusion} as an approach to compositional scene diffusion, providing control over semantic parts using text prompts and bounding boxes while ensuring seamless transitions between these parts. We demonstrate a score distillation sampling--based text-to-3D synthesis pipeline that enables compositional 3D scene generation at a higher fidelity than relevant baselines.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Processing and 3D Reconstruction · Human Motion and Animation · Computer Graphics and Visualization Techniques
