GaussTR: Foundation Model-Aligned Gaussian Transformer for   Self-Supervised 3D Spatial Understanding

Haoyi Jiang; Liu Liu; Tianheng Cheng; Xinjie Wang; Tianwei Lin,; Zhizhong Su; Wenyu Liu; Xinggang Wang

arXiv:2412.13193·cs.CV·March 25, 2025

GaussTR: Foundation Model-Aligned Gaussian Transformer for Self-Supervised 3D Spatial Understanding

Haoyi Jiang, Liu Liu, Tianheng Cheng, Xinjie Wang, Tianwei Lin,, Zhizhong Su, Wenyu Liu, Xinggang Wang

PDF

Open Access 1 Repo

TL;DR

GaussTR introduces a Gaussian Transformer framework that unifies sparse 3D modeling with foundation model alignment, enabling scalable, self-supervised 3D semantic understanding with zero-shot capabilities and reduced training time.

Contribution

GaussTR is the first framework to predict sparse Gaussians for 3D scenes and align them with foundation models for open-vocabulary semantic occupancy prediction.

Findings

01

Achieves 12.27 mIoU in zero-shot setting on Occ3D-nuScenes

02

Reduces training time by 40% compared to previous methods

03

Demonstrates state-of-the-art performance in scalable 3D spatial understanding

Abstract

3D Semantic Occupancy Prediction is fundamental for spatial understanding, yet existing approaches face challenges in scalability and generalization due to their reliance on extensive labeled data and computationally intensive voxel-wise representations. In this paper, we introduce GaussTR, a novel Gaussian-based Transformer framework that unifies sparse 3D modeling with foundation model alignment through Gaussian representations to advance 3D spatial understanding. GaussTR predicts sparse sets of Gaussians in a feed-forward manner to represent 3D scenes. By splatting the Gaussians into 2D views and aligning the rendered features with foundation models, GaussTR facilitates self-supervised 3D representation learning and enables open-vocabulary semantic occupancy prediction without requiring explicit annotations. Empirical experiments on the Occ3D-nuScenes dataset demonstrate GaussTR's…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hustvl/gausstr
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Processing and 3D Reconstruction · Robotics and Sensor-Based Localization · 3D Surveying and Cultural Heritage

MethodsAttention Is All You Need · Linear Layer · Dropout · Dense Connections · Byte Pair Encoding · Multi-Head Attention · Adam · Layer Normalization · Position-Wise Feed-Forward Layer · Label Smoothing