HouseCrafter: Lifting Floorplans to 3D Scenes with 2D Diffusion Model
Hieu T. Nguyen, Yiwen Chen, Vikram Voleti, Varun Jampani, Huaizu Jiang

TL;DR
HouseCrafter leverages a 2D diffusion model trained on web images to generate consistent multi-view RGB-D images from floorplans, enabling the reconstruction of detailed large-scale 3D indoor scenes.
Contribution
This work adapts a 2D diffusion model for multi-view indoor scene generation from floorplans, introducing a novel autoregressive and attention-based approach for 3D scene synthesis.
Findings
High-quality 3D scenes generated from floorplans
Effective multi-view consistency across generated images
Validated on 3D-Front dataset with ablation studies
Abstract
We introduce HouseCrafter, a novel approach that can lift a floorplan into a complete large 3D indoor scene (e.g., a house). Our key insight is to adapt a 2D diffusion model, which is trained on web-scale images, to generate consistent multi-view color (RGB) and depth (D) images across different locations of the scene. Specifically, the RGB-D images are generated autoregressively in a batch-wise manner along sampled locations based on the floorplan, where previously generated images are used as condition to the diffusion model to produce images at nearby locations. The global floorplan and attention design in the diffusion model ensures the consistency of the generated images, from which a 3D scene can be reconstructed. Through extensive evaluation on the 3D-Front dataset, we demonstrate that HouseCraft can generate high-quality house-scale 3D scenes. Ablation studies also validate the…
Peer Reviews
Decision·Submitted to ICLR 2025
(1) This is a well-written paper. (2) This method generates high-resolution depth images for larges-scale scene reconstruction, which is more practicala and meaningful in real-world scenerios. (3) The proposed method is compared with various methods. The experiments are complete and convincing (4) Some visualizations are helpful to understand.
(1) Lack of inference time comparison.
1. Using 2D floorplans as a conditioning method requires less human intervention and manual effort, yet produces globally coherent 3D scenes. In contrast, methods like Text2Room are relatively simple but do not ensure a plausible layout in generated 3D scenes. This highlights the advantages of using 2D floorplans as a basis for 3D scene generation. 2. The proposed approach of floorplan conditioning and multi-view RGB-D conditioning may inspire new methods for extending 2D generation to 3D at the
1. The current comparisons are not entirely convincing. For example, CC3D focuses on novel view rendering at the scene level rather than geometry, and its renderings qualitatively appear superior to those in this work. Additionally, Text2Room, which generates scene layouts based on text input alone, may not provide an appropriate basis for comparison in this context. 2. Since this paper focuses on floorplan-based 3D scene generation, a more relevant comparison would be with BlockFusion [1], whic
- The paper addresses a compelling problem with potential applications in architectural design and game development. - The paper covers the literature of novel-view synthesis well. - The method is clear and well explained. - The floorplan embedding method is novel and clever. - The generated results are geometrically and visually consistent.
- The paper does not clarify why generating "house-scale" scenes is preferable to generating "room-scale" scenes. Why is consistency across the entire house necessary, rather than generating individual rooms and combining them? In reality, different rooms within a house may vary in style and are not necessarily consistent with one another. This needs to be clarified as the authors used it as grounds to exclude comparison against Controlroom3d and Ctrl-Room at L207. - The authors argue at L203 t
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArchitecture and Computational Design
MethodsSoftmax · Attention Is All You Need · Diffusion
