TL;DR
LangScene-X introduces a generative framework that reconstructs and understands 3D scenes with open-vocabulary language embedding from sparse views, overcoming limitations of previous dense-view methods.
Contribution
It unifies 3D reconstruction and language understanding using a TriMap video diffusion model and a Language Quantized Compressor for cross-scene generalization.
Findings
Outperforms state-of-the-art in quality and generalizability
Generates consistent novel observations from sparse views
Enables open-ended language queries on 3D scenes
Abstract
Recovering 3D structures with open-vocabulary scene understanding from 2D images is a fundamental but daunting task. Recent developments have achieved this by performing per-scene optimization with embedded language information. However, they heavily rely on the calibrated dense-view reconstruction paradigm, thereby suffering from severe rendering artifacts and implausible semantic synthesis when limited views are available. In this paper, we introduce a novel generative framework, coined LangScene-X, to unify and generate 3D consistent multi-modality information for reconstruction and understanding. Powered by the generative capability of creating more consistent novel observations, we can build generalizable 3D language-embedded scenes from only sparse views. Specifically, we first train a TriMap video diffusion model that can generate appearance (RGBs), geometry (normals), and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
