GALA: Guided Attention with Language Alignment for Open Vocabulary Gaussian Splatting

Elena Alegret; Kunyi Li; Sen Wang; Siyun Liang; Michael Niemeyer; Stefano Gasperini; Nassir Navab; Federico Tombari

arXiv:2508.14278·cs.CV·August 22, 2025

GALA: Guided Attention with Language Alignment for Open Vocabulary Gaussian Splatting

Elena Alegret, Kunyi Li, Sen Wang, Siyun Liang, Michael Niemeyer, Stefano Gasperini, Nassir Navab, Federico Tombari

PDF

Open Access

TL;DR

GALA introduces a novel framework that combines guided attention with language alignment to enable open-vocabulary 3D scene understanding using Gaussian Splatting, effectively capturing fine-grained, language-aware 3D representations from 2D images.

Contribution

The paper proposes a cross-attention module with learnable codebooks for encoding view-independent semantic embeddings, enhancing open-vocabulary 3D scene understanding with reduced memory usage.

Findings

01

GALA achieves state-of-the-art open-vocabulary performance on real-world datasets.

02

It effectively supports 2D and 3D queries with improved semantic consistency.

03

The framework reduces memory consumption compared to previous methods.

Abstract

3D scene reconstruction and understanding have gained increasing popularity, yet existing methods still struggle to capture fine-grained, language-aware 3D representations from 2D images. In this paper, we present GALA, a novel framework for open-vocabulary 3D scene understanding with 3D Gaussian Splatting (3DGS). GALA distills a scene-specific 3D instance feature field via self-supervised contrastive learning. To extend to generalized language feature fields, we introduce the core contribution of GALA, a cross-attention module with two learnable codebooks that encode view-independent semantic embeddings. This design not only ensures intra-instance feature similarity but also supports seamless 2D and 3D open-vocabulary queries. It reduces memory consumption by avoiding per-Gaussian high-dimensional feature learning. Extensive experiments on real-world datasets demonstrate GALA's…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · 3D Shape Modeling and Analysis · Robotics and Sensor-Based Localization