TL;DR
World2Minecraft converts real-world scenes into Minecraft environments using occupancy prediction, enhancing simulation fidelity for embodied AI tasks, supported by a new large-scale dataset and scalable data collection pipeline.
Contribution
The paper introduces a novel method for scene reconstruction into Minecraft environments and presents a large-scale occupancy dataset to improve prediction accuracy and generalization.
Findings
The dataset MinecraftOcc contains 100,165 images from 156 indoor scenes.
Experiments show the dataset enhances occupancy prediction and challenges current SOTA methods.
World2Minecraft enables flexible, high-fidelity simulation for embodied AI research.
Abstract
Embodied intelligence requires high-fidelity simulation environments to support perception and decision-making, yet existing platforms often suffer from data contamination and limited flexibility. To mitigate this, we propose World2Minecraft to convert real-world scenes into structured Minecraft environments based on 3D semantic occupancy prediction. In the reconstructed scenes, we can effortlessly perform downstream tasks such as Vision-Language Navigation(VLN). However, we observe that reconstruction quality heavily depends on accurate occupancy prediction, which remains limited by data scarcity and poor generalization in existing models. We introduce a low-cost, automated, and scalable data acquisition pipeline for creating customized occupancy datasets, and demonstrate its effectiveness through MinecraftOcc, a large-scale dataset featuring 100,165 images from 156 richly detailed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
