LangScene-X: Reconstruct Generalizable 3D Language-Embedded Scenes with TriMap Video Diffusion

Fangfu Liu; Hao Li; Jiawei Chi; Hanyang Wang; Minghui Yang; Fudong Wang; Yueqi Duan

arXiv:2507.02813·cs.CV·July 4, 2025

LangScene-X: Reconstruct Generalizable 3D Language-Embedded Scenes with TriMap Video Diffusion

Fangfu Liu, Hao Li, Jiawei Chi, Hanyang Wang, Minghui Yang, Fudong Wang, Yueqi Duan

PDF

1 Models

TL;DR

LangScene-X introduces a generative framework that reconstructs and understands 3D scenes with open-vocabulary language embedding from sparse views, overcoming limitations of previous dense-view methods.

Contribution

It unifies 3D reconstruction and language understanding using a TriMap video diffusion model and a Language Quantized Compressor for cross-scene generalization.

Findings

01

Outperforms state-of-the-art in quality and generalizability

02

Generates consistent novel observations from sparse views

03

Enables open-ended language queries on 3D scenes

Abstract

Recovering 3D structures with open-vocabulary scene understanding from 2D images is a fundamental but daunting task. Recent developments have achieved this by performing per-scene optimization with embedded language information. However, they heavily rely on the calibrated dense-view reconstruction paradigm, thereby suffering from severe rendering artifacts and implausible semantic synthesis when limited views are available. In this paper, we introduce a novel generative framework, coined LangScene-X, to unify and generate 3D consistent multi-modality information for reconstruction and understanding. Powered by the generative capability of creating more consistent novel observations, we can build generalizable 3D language-embedded scenes from only sparse views. Specifically, we first train a TriMap video diffusion model that can generate appearance (RGBs), geometry (normals), and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
chijw/LangScene-X
model· ♡ 5
♡ 5

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.