CLIP-Layout: Style-Consistent Indoor Scene Synthesis with Semantic Furniture Embedding
Jingyu Liu, Wenhan Xiong, Ian Jones, Yixin Nie, Anchit Gupta, Barlas, O\u{g}uz

TL;DR
This paper introduces CLIP-Layout, a novel indoor scene synthesis model that uses CLIP embeddings for instance-level predictions, enabling style-consistent, visually coherent, and zero-shot text-guided scene generation.
Contribution
It presents an auto-regressive model leveraging CLIP embeddings for detailed furniture placement, surpassing previous methods that relied on category labels and ignoring visual attributes.
Findings
Achieves state-of-the-art results on 3D-FRONT dataset.
Improves auto-completion metrics by over 50%.
Enables zero-shot text-guided scene editing.
Abstract
Indoor scene synthesis involves automatically picking and placing furniture appropriately on a floor plan, so that the scene looks realistic and is functionally plausible. Such scenes can serve as homes for immersive 3D experiences, or be used to train embodied agents. Existing methods for this task rely on labeled categories of furniture, e.g. bed, chair or table, to generate contextually relevant combinations of furniture. Whether heuristic or learned, these methods ignore instance-level visual attributes of objects, and as a result may produce visually less coherent scenes. In this paper, we introduce an auto-regressive scene model which can output instance-level predictions, using general purpose image embedding based on CLIP. This allows us to learn visual correspondences such as matching color and style, and produce more functionally plausible and aesthetically pleasing scenes.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · 3D Surveying and Cultural Heritage · Generative Adversarial Networks and Image Synthesis
MethodsContrastive Language-Image Pre-training
