HOLODECK 2.0: Vision-Language-Guided 3D World Generation with Editing
Zixuan Bian, Ruohan Ren, Yue Yang, Chris Callison-Burch

TL;DR
HOLODECK 2.0 is a vision-language-guided framework that generates and edits diverse, high-quality 3D scenes from detailed descriptions, improving automation and flexibility in 3D world creation.
Contribution
It introduces a novel interactive 3D scene generation and editing system leveraging vision-language models and state-of-the-art 3D generative models, supporting open-domain and style-rich environments.
Findings
Outperforms baselines in scene quality and semantic fidelity
Supports flexible scene editing based on human feedback
Effective in diverse styles and open-domain scenarios
Abstract
3D scene generation plays a crucial role in gaming, artistic creation, virtual reality, and many other domains. However, current 3D scene design still relies heavily on extensive manual effort from creators, and existing automated methods struggle to generate open-domain scenes or support flexible editing. To address those challenges, we introduce HOLODECK 2.0, an advanced vision-language-guided framework for 3D world generation with support for interactive scene editing based on human feedback. HOLODECK 2.0 can generate diverse and stylistically rich 3D scenes (e.g., realistic, cartoon, anime, and cyberpunk styles) that exhibit high semantic fidelity to fine-grained input descriptions, suitable for both indoor and open-domain environments. HOLODECK 2.0 leverages vision-language models (VLMs) to identify and parse the objects required in a scene and generates corresponding high-quality…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · Multimodal Machine Learning Applications
