BlendCLIP: Bridging Synthetic and Real Domains for Zero-Shot 3D Object Classification with Multimodal Pretraining

Ajinkya Khoche; Gerg\H{o} L\'aszl\'o Nagy; Maciej Wozniak; Thomas Gustafsson; Patric Jensfelt

arXiv:2510.18244·cs.CV·October 22, 2025

BlendCLIP: Bridging Synthetic and Real Domains for Zero-Shot 3D Object Classification with Multimodal Pretraining

Ajinkya Khoche, Gerg\H{o} L\'aszl\'o Nagy, Maciej Wozniak, Thomas Gustafsson, Patric Jensfelt

PDF

Open Access

TL;DR

BlendCLIP introduces a multimodal pretraining framework that effectively bridges the synthetic-to-real domain gap in zero-shot 3D object classification, significantly improving outdoor scene recognition with minimal real-world data.

Contribution

It proposes a curriculum-based data mixing strategy and a large-scale dataset of multimodal triplets to enhance domain adaptation in zero-shot 3D classification.

Findings

01

Boosts zero-shot accuracy on nuScenes by 27% with minimal real data

02

Achieves 19.3% improvement over prior methods on nuScenes

03

Maintains strong generalization across synthetic and real datasets

Abstract

Zero-shot 3D object classification is crucial for real-world applications like autonomous driving, however it is often hindered by a significant domain gap between the synthetic data used for training and the sparse, noisy LiDAR scans encountered in the real-world. Current methods trained solely on synthetic data fail to generalize to outdoor scenes, while those trained only on real data lack the semantic diversity to recognize rare or unseen objects. We introduce BlendCLIP, a multimodal pretraining framework that bridges this synthetic-to-real gap by strategically combining the strengths of both domains. We first propose a pipeline to generate a large-scale dataset of object-level triplets -- consisting of a point cloud, image, and text description -- mined directly from real-world driving data and human annotated 3D boxes. Our core contribution is a curriculum-based data mixing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · 3D Shape Modeling and Analysis · Domain Adaptation and Few-Shot Learning