HOG-Layout: Hierarchical 3D Scene Generation, Optimization and Editing via Vision-Language Models
Haiyan Jiang, Deyu Zhang, Dongdong Weng, Weitao Song, Henry Been-Lirn Duh

TL;DR
HOG-Layout leverages large language and vision-language models to enable real-time, text-driven 3D scene generation, optimization, and editing with improved semantic consistency and physical plausibility.
Contribution
It introduces a hierarchical scene representation combined with retrieval-augmented generation and optimization modules for fast, plausible, and editable 3D scene synthesis.
Findings
Produces more reasonable environments than existing baselines.
Supports fast and intuitive scene editing.
Enhances semantic and physical consistency in 3D scenes.
Abstract
3D layout generation and editing play a crucial role in Embodied AI and immersive VR interaction. However, manual creation requires tedious labor, while data-driven generation often lacks diversity. The emergence of large models introduces new possibilities for 3D scene synthesis. We present HOG-Layout that enables text-driven hierarchical scene generation, optimization and real-time scene editing with large language models (LLMs) and vision-language models (VLMs). HOG-Layout improves scene semantic consistency and plausibility through retrieval-augmented generation (RAG) technology, incorporates an optimization module to enhance physical consistency, and adopts a hierarchical representation to enhance inference and optimization, achieving real-time editing. Experimental results demonstrate that HOG-Layout produces more reasonable environments compared with existing baselines, while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
