Ctrl-Room: Controllable Text-to-3D Room Meshes Generation with Layout Constraints
Chuan Fang, Yuan Dong, Kunming Luo, Xiaotao Hu, Rakesh Shrestha, Ping Tan

TL;DR
Ctrl-Room is a novel method that generates high-quality, editable 3D indoor rooms from text prompts by separating layout and appearance modeling, enabling flexible editing and realistic scene creation.
Contribution
The paper introduces a two-stage approach combining layout and appearance generation, with scene code parameterization for easy editing, advancing controllable text-to-3D room synthesis.
Findings
Outperforms existing methods in layout reasonableness and view consistency
Enables flexible editing of generated rooms without retraining
Produces high-fidelity textures and convincing room layouts
Abstract
Text-driven 3D indoor scene generation is useful for gaming, the film industry, and AR/VR applications. However, existing methods cannot faithfully capture the room layout, nor do they allow flexible editing of individual objects in the room. To address these problems, we present Ctrl-Room, which can generate convincing 3D rooms with designer-style layouts and high-fidelity textures from just a text prompt. Moreover, Ctrl-Room enables versatile interactive editing operations such as resizing or moving individual furniture items. Our key insight is to separate the modeling of layouts and appearance. Our proposed method consists of two stages: a Layout Generation Stage and an Appearance Generation Stage. The Layout Generation Stage trains a text-conditional diffusion model to learn the layout distribution with our holistic scene code parameterization. Next, the Appearance Generation Stage…
Peer Reviews
Decision·ICLR 2024 Conference Withdrawn Submission
1. The experimental results are impressive. 2. The two-stage design ease the editing of the generated 3D indoor scenes.
Lack of technical novelty. Although two-stage design has its own advantages, it is more like a design strategy than a solid technical contribution, since such design is widely used in 3D content generation. For instance, visual object networks first generate geometry through a shape network and then generate rendering results through a texture network.
1. The proposed method can locally control the 3D room generation, which can generate plausible indoor scenes. 2. The proposed method separates the layout generation and appearance generation. 3. The proposed method can achieve 3D indoor scene editing.
1. There are two diffusion models, which will lead to both large computation costs and GPU memory costs. 2. Although the rendering views seem better, the geometry seems worse than the MVDifffusion based on Figure 5.
(+) The intuitive approach of generating the layout first and then the appearance allows for fine-grained control over the generated scenes, enabling easy human intervention in editing the layout. (+) The method, to some extent, avoids problems faced by existing methods, such as generating multiple beds in the same room, ensuring more realistic scene generation. (+) The paper introduces the concept of loop consistency sampling, ensuring that the generated panoramic images maintain their cycli
(-) The method heavily relies on 3D bounding box annotations. Given the scarcity of datasets with such 3D annotations, the generalization capability of the text-to-layout process is limited. The experiments are restricted to generating only living rooms and bedrooms, without exploring the generation of other room types. (-) As mentioned in the appendix, the current approach can only generate textures for a single panoramic image. It doesn't support multi-viewpoint generation, which limits the v
1. Adding controls into scene-scale 3D generation is a promising direction, and this paper gives the first several attempts to solve it. The whole pipeline is reasonable and sound. 2. The editing experiment is impressive and interesting.
1. The variety of generation. Since both scene generator and controlnet are trained on Structure3D dataset, which contains only "living room" and "bedroom", I am concerned the variety of scenes this method could generate. I am also curious about how does it work on some other rooms, like "bathroom", "kitchen" and others? 2. Text prompts are strange. The text prompts shown in this paper are not very natural. For example, " the living room has eight walls,. The room has a picture, a shelves and a
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · 3D Shape Modeling and Analysis · Human Motion and Animation
MethodsDiffusion
