4DLangVGGT: 4D Language-Visual Geometry Grounded Transformer

Xianfeng Wu; Yajing Bai; Minghan Li; Xianzu Wu; Xueqi Zhao; Zhongyuan Lai; Wenyu Liu; Xinggang Wang

arXiv:2512.05060·cs.CV·December 5, 2025

4DLangVGGT: 4D Language-Visual Geometry Grounded Transformer

Xianfeng Wu, Yajing Bai, Minghan Li, Xianzu Wu, Xueqi Zhao, Zhongyuan Lai, Wenyu Liu, Xinggang Wang

PDF

Open Access

TL;DR

This paper introduces 4DLangVGGT, a Transformer-based framework for 4D language grounding that effectively models dynamic scenes and generalizes across multiple scenes, outperforming prior scene-specific methods.

Contribution

The paper presents the first unified Transformer-based approach for 4D language grounding that jointly models geometry and semantics, enabling scalable and generalizable 4D scene understanding.

Findings

01

Achieves state-of-the-art results on HyperNeRF and Neu3D datasets.

02

Demonstrates effective generalization across multiple dynamic scenes.

03

Outperforms scene-specific methods with up to 2% accuracy gains.

Abstract

Constructing 4D language fields is crucial for embodied AI, augmented/virtual reality, and 4D scene understanding, as they provide enriched semantic representations of dynamic environments and enable open-vocabulary querying in complex scenarios. However, existing approaches to 4D semantic field construction primarily rely on scene-specific Gaussian splatting, which requires per-scene optimization, exhibits limited generalization, and is difficult to scale to real-world applications. To address these limitations, we propose 4DLangVGGT, the first Transformer-based feed-forward unified framework for 4D language grounding, that jointly integrates geometric perception and language alignment within a single architecture. 4DLangVGGT has two key components: the 4D Visual Geometry Transformer, StreamVGGT, which captures spatio-temporal geometric representations of dynamic scenes; and the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis