TL;DR
ShelfGaussian is a novel 3D scene understanding framework that leverages off-the-shelf vision foundation models to enable open-vocabulary, multi-modal Gaussian representations for improved perception and planning in diverse environments.
Contribution
It introduces a Multi-Modal Gaussian Transformer and a Shelf-Supervised Learning Paradigm to optimize Gaussian representations across multiple sensor modalities and scene levels.
Findings
Achieves state-of-the-art zero-shot semantic occupancy prediction on Occ3D-nuScenes.
Demonstrates effective in-the-wild performance on urban scenarios with UGVs.
Abstract
We introduce ShelfGaussian, an open-vocabulary multi-modal Gaussian-based 3D scene understanding framework supervised by off-the-shelf vision foundation models (VFMs). Gaussian-based methods have demonstrated superior performance and computational efficiency across a wide range of scene understanding tasks. However, existing methods either model objects as closed-set semantic Gaussians supervised by annotated 3D labels, neglecting their rendering ability, or learn open-set Gaussian representations via purely 2D self-supervision, leading to degraded geometry and limited to camera-only settings. To fully exploit the potential of Gaussians, we propose a Multi-Modal Gaussian Transformer that enables Gaussians to query features from diverse sensor modalities, and a Shelf-Supervised Learning Paradigm that efficiently optimizes Gaussians with VFM features jointly at 2D image and 3D scene…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
