Self-Supervised and Generalizable Tokenization for CLIP-Based 3D Understanding
Guofeng Mei, Bin Ren, Juan Liu, Luigi Riz, Xiaoshui Huang, Xu Zheng, Yongshun Gong, Ming-Hsuan Yang, Nicu Sebe, Fabio Poiesi

TL;DR
This paper introduces S4Token, a scale-invariant 3D tokenizer for CLIP-based models that improves cross-domain generalization in 3D scene understanding by combining superpoint grouping, normalization, and self-supervised training.
Contribution
The paper proposes a universal, scale-invariant 3D tokenizer called S4Token that outperforms traditional methods and is trained without annotations using self-supervised objectives.
Findings
S4Token achieves better cross-domain generalization than conventional methods.
The tokenizer improves dense prediction accuracy with a superpoint-level feature propagation module.
Extensive experiments validate the effectiveness of the proposed approach.
Abstract
Vision-language models like CLIP can offer a promising foundation for 3D scene understanding when extended with 3D tokenizers. However, standard approaches, such as k-nearest neighbor or radius-based tokenization, struggle with cross-domain generalization due to sensitivity to dataset-specific spatial scales. We present a universal 3D tokenizer designed for scale-invariant representation learning with a frozen CLIP backbone. We show that combining superpoint-based grouping with coordinate scale normalization consistently outperforms conventional methods through extensive experimental analysis. Specifically, we introduce S4Token, a tokenization pipeline that produces semantically-informed tokens regardless of scene scale. Our tokenizer is trained without annotations using masked point modeling and clustering-based objectives, along with cross-modal distillation to align 3D tokens with 2D…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsContrastive Language-Image Pre-training · ALIGN
