Language and Geometry Grounded Sparse Voxel Representations for Holistic Scene Understanding

Guile Wu; David Huang; Bingbing Liu; and Dongfeng Bai

arXiv:2602.15734·cs.CV·February 18, 2026

Language and Geometry Grounded Sparse Voxel Representations for Holistic Scene Understanding

Guile Wu, David Huang, Bingbing Liu, and Dongfeng Bai

PDF

Open Access

TL;DR

This paper introduces a unified 3D scene understanding framework that integrates language, appearance, semantics, and geometry using sparse voxel representations, achieving superior results over existing methods.

Contribution

It proposes a novel approach combining language and geometric grounding with sparse voxel representations for holistic 3D scene modeling.

Findings

01

Outperforms state-of-the-art in scene understanding and reconstruction

02

Effectively integrates language, appearance, semantics, and geometry

03

Demonstrates improved synergy among scene features

Abstract

Existing 3D open-vocabulary scene understanding methods mostly emphasize distilling language features from 2D foundation models into 3D feature fields, but largely overlook the synergy among scene appearance, semantics, and geometry. As a result, scene understanding often deviates from the underlying geometric structure of scenes and becomes decoupled from the reconstruction process. In this work, we propose a novel approach that leverages language and geometry grounded sparse voxel representations to comprehensively model appearance, semantics, and geometry within a unified framework. Specifically, we use 3D sparse voxels as primitives and employ an appearance field, a density field, a feature field, and a confidence field to holistically represent a 3D scene. To promote synergy among the appearance, density, and feature fields, we construct a feature modulation module and distill…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Topics3D Shape Modeling and Analysis · Face recognition and analysis · Robotics and Sensor-Based Localization