Hierarchically-Structured Open-Vocabulary Indoor Scene Synthesis with   Pre-trained Large Language Model

Weilin Sun; Xinran Li; Manyi Li; Kai Xu; Xiangxu Meng; Lei Meng

arXiv:2502.10675·cs.CV·February 18, 2025

Hierarchically-Structured Open-Vocabulary Indoor Scene Synthesis with Pre-trained Large Language Model

Weilin Sun, Xinran Li, Manyi Li, Kai Xu, Xiangxu Meng, Lei Meng

PDF

Open Access 1 Video

TL;DR

This paper introduces a hierarchical approach to indoor scene synthesis using pre-trained large language models, enabling the generation of realistic, diverse, and user-aligned 3D indoor scenes through structured descriptions and layout optimization.

Contribution

The paper proposes a hierarchy-aware network and divide-and-conquer optimization for feasible scene layout generation from LLM outputs, improving realism and generalization in open-vocabulary indoor scene synthesis.

Findings

01

Generated scene layouts are more reasonable and aligned with user requirements.

02

Hierarchical structure improves object placement consistency and layout feasibility.

03

The approach outperforms existing methods in qualitative and quantitative evaluations.

Abstract

Indoor scene synthesis aims to automatically produce plausible, realistic and diverse 3D indoor scenes, especially given arbitrary user requirements. Recently, the promising generalization ability of pre-trained large language models (LLM) assist in open-vocabulary indoor scene synthesis. However, the challenge lies in converting the LLM-generated outputs into reasonable and physically feasible scene layouts. In this paper, we propose to generate hierarchically structured scene descriptions with LLM and then compute the scene layouts. Specifically, we train a hierarchy-aware network to infer the fine-grained relative positions between objects and design a divide-and-conquer optimization to solve for scene layouts. The advantages of using hierarchically structured scene representation are two-fold. First, the hierarchical structure provides a rough grounding for object arrangement, which…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Hierarchically-Structured Open-Vocabulary Indoor Scene Synthesis with Pre-trained Large Language Model· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition