HOLODECK 2.0: Vision-Language-Guided 3D World Generation with Editing

Zixuan Bian; Ruohan Ren; Yue Yang; Chris Callison-Burch

arXiv:2508.05899·cs.CV·December 22, 2025

HOLODECK 2.0: Vision-Language-Guided 3D World Generation with Editing

Zixuan Bian, Ruohan Ren, Yue Yang, Chris Callison-Burch

PDF

Open Access

TL;DR

HOLODECK 2.0 is a vision-language-guided framework that generates and edits diverse, high-quality 3D scenes from detailed descriptions, improving automation and flexibility in 3D world creation.

Contribution

It introduces a novel interactive 3D scene generation and editing system leveraging vision-language models and state-of-the-art 3D generative models, supporting open-domain and style-rich environments.

Findings

01

Outperforms baselines in scene quality and semantic fidelity

02

Supports flexible scene editing based on human feedback

03

Effective in diverse styles and open-domain scenarios

Abstract

3D scene generation plays a crucial role in gaming, artistic creation, virtual reality, and many other domains. However, current 3D scene design still relies heavily on extensive manual effort from creators, and existing automated methods struggle to generate open-domain scenes or support flexible editing. To address those challenges, we introduce HOLODECK 2.0, an advanced vision-language-guided framework for 3D world generation with support for interactive scene editing based on human feedback. HOLODECK 2.0 can generate diverse and stylistically rich 3D scenes (e.g., realistic, cartoon, anime, and cyberpunk styles) that exhibit high semantic fidelity to fine-grained input descriptions, suitable for both indoor and open-domain environments. HOLODECK 2.0 leverages vision-language models (VLMs) to identify and parse the objects required in a scene and generates corresponding high-quality…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Motion and Animation · Multimodal Machine Learning Applications