GaussTR: Foundation Model-Aligned Gaussian Transformer for Self-Supervised 3D Spatial Understanding
Haoyi Jiang, Liu Liu, Tianheng Cheng, Xinjie Wang, Tianwei Lin,, Zhizhong Su, Wenyu Liu, Xinggang Wang

TL;DR
GaussTR introduces a Gaussian Transformer framework that unifies sparse 3D modeling with foundation model alignment, enabling scalable, self-supervised 3D semantic understanding with zero-shot capabilities and reduced training time.
Contribution
GaussTR is the first framework to predict sparse Gaussians for 3D scenes and align them with foundation models for open-vocabulary semantic occupancy prediction.
Findings
Achieves 12.27 mIoU in zero-shot setting on Occ3D-nuScenes
Reduces training time by 40% compared to previous methods
Demonstrates state-of-the-art performance in scalable 3D spatial understanding
Abstract
3D Semantic Occupancy Prediction is fundamental for spatial understanding, yet existing approaches face challenges in scalability and generalization due to their reliance on extensive labeled data and computationally intensive voxel-wise representations. In this paper, we introduce GaussTR, a novel Gaussian-based Transformer framework that unifies sparse 3D modeling with foundation model alignment through Gaussian representations to advance 3D spatial understanding. GaussTR predicts sparse sets of Gaussians in a feed-forward manner to represent 3D scenes. By splatting the Gaussians into 2D views and aligning the rendered features with foundation models, GaussTR facilitates self-supervised 3D representation learning and enables open-vocabulary semantic occupancy prediction without requiring explicit annotations. Empirical experiments on the Occ3D-nuScenes dataset demonstrate GaussTR's…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Processing and 3D Reconstruction · Robotics and Sensor-Based Localization · 3D Surveying and Cultural Heritage
MethodsAttention Is All You Need · Linear Layer · Dropout · Dense Connections · Byte Pair Encoding · Multi-Head Attention · Adam · Layer Normalization · Position-Wise Feed-Forward Layer · Label Smoothing
