GaussianVLM: Scene-centric 3D Vision-Language Models using Language-aligned Gaussian Splats for Embodied Reasoning and Beyond

Anna-Maria Halacheva; Jan-Nico Zaech; Xi Wang; Danda Pani Paudel; Luc Van Gool

arXiv:2507.00886·cs.CV·July 2, 2025

GaussianVLM: Scene-centric 3D Vision-Language Models using Language-aligned Gaussian Splats for Embodied Reasoning and Beyond

Anna-Maria Halacheva, Jan-Nico Zaech, Xi Wang, Danda Pani Paudel, Luc Van Gool

PDF

1 Datasets

TL;DR

GaussianVLM introduces a scene-centric 3D vision-language model that embeds linguistic features into Gaussian splat scenes, enabling efficient, task-aware representations and significantly improving out-of-domain 3D scene understanding.

Contribution

It is the first Gaussian splatting-based 3D VLM that directly integrates language into dense scene representations for improved generalization.

Findings

01

Achieves fivefold performance improvement over prior 3D VLMs.

02

Operates effectively with photorealistic 3D representations from RGB images.

03

Demonstrates strong generalization in out-of-domain scenarios.

Abstract

As multimodal language models advance, their application to 3D scene understanding is a fast-growing frontier, driving the development of 3D Vision-Language Models (VLMs). Current methods show strong dependence on object detectors, introducing processing bottlenecks and limitations in taxonomic flexibility. To address these limitations, we propose a scene-centric 3D VLM for 3D Gaussian splat scenes that employs language- and task-aware scene representations. Our approach directly embeds rich linguistic features into the 3D scene representation by associating language with each Gaussian primitive, achieving early modality alignment. To process the resulting dense representations, we introduce a dual sparsifier that distills them into compact, task-relevant tokens via task-guided and location-guided pathways, producing sparse, task-aware global and local scene tokens. Notably, we present…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

amhalacheva/GaussianVLM_results
dataset· 22 dl
22 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.