Self-Supervised and Generalizable Tokenization for CLIP-Based 3D Understanding

Guofeng Mei; Bin Ren; Juan Liu; Luigi Riz; Xiaoshui Huang; Xu Zheng; Yongshun Gong; Ming-Hsuan Yang; Nicu Sebe; Fabio Poiesi

arXiv:2505.18819·cs.CV·May 27, 2025

Self-Supervised and Generalizable Tokenization for CLIP-Based 3D Understanding

Guofeng Mei, Bin Ren, Juan Liu, Luigi Riz, Xiaoshui Huang, Xu Zheng, Yongshun Gong, Ming-Hsuan Yang, Nicu Sebe, Fabio Poiesi

PDF

TL;DR

This paper introduces S4Token, a scale-invariant 3D tokenizer for CLIP-based models that improves cross-domain generalization in 3D scene understanding by combining superpoint grouping, normalization, and self-supervised training.

Contribution

The paper proposes a universal, scale-invariant 3D tokenizer called S4Token that outperforms traditional methods and is trained without annotations using self-supervised objectives.

Findings

01

S4Token achieves better cross-domain generalization than conventional methods.

02

The tokenizer improves dense prediction accuracy with a superpoint-level feature propagation module.

03

Extensive experiments validate the effectiveness of the proposed approach.

Abstract

Vision-language models like CLIP can offer a promising foundation for 3D scene understanding when extended with 3D tokenizers. However, standard approaches, such as k-nearest neighbor or radius-based tokenization, struggle with cross-domain generalization due to sensitivity to dataset-specific spatial scales. We present a universal 3D tokenizer designed for scale-invariant representation learning with a frozen CLIP backbone. We show that combining superpoint-based grouping with coordinate scale normalization consistently outperforms conventional methods through extensive experimental analysis. Specifically, we introduce S4Token, a tokenization pipeline that produces semantically-informed tokens regardless of scene scale. Our tokenizer is trained without annotations using masked point modeling and clustering-based objectives, along with cross-modal distillation to align 3D tokens with 2D…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsContrastive Language-Image Pre-training · ALIGN