Language and Geometry Grounded Sparse Voxel Representations for Holistic Scene Understanding
Guile Wu, David Huang, Bingbing Liu, and Dongfeng Bai

TL;DR
This paper introduces a unified 3D scene understanding framework that integrates language, appearance, semantics, and geometry using sparse voxel representations, achieving superior results over existing methods.
Contribution
It proposes a novel approach combining language and geometric grounding with sparse voxel representations for holistic 3D scene modeling.
Findings
Outperforms state-of-the-art in scene understanding and reconstruction
Effectively integrates language, appearance, semantics, and geometry
Demonstrates improved synergy among scene features
Abstract
Existing 3D open-vocabulary scene understanding methods mostly emphasize distilling language features from 2D foundation models into 3D feature fields, but largely overlook the synergy among scene appearance, semantics, and geometry. As a result, scene understanding often deviates from the underlying geometric structure of scenes and becomes decoupled from the reconstruction process. In this work, we propose a novel approach that leverages language and geometry grounded sparse voxel representations to comprehensively model appearance, semantics, and geometry within a unified framework. Specifically, we use 3D sparse voxels as primitives and employ an appearance field, a density field, a feature field, and a confidence field to holistically represent a 3D scene. To promote synergy among the appearance, density, and feature fields, we construct a feature modulation module and distill…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Shape Modeling and Analysis · Face recognition and analysis · Robotics and Sensor-Based Localization
