TeDA: Boosting Vision-Lanuage Models for Zero-Shot 3D Object Retrieval via Testing-time Distribution Alignment
Zhichuan Wang, Yang Zhou, Jinhai Xiang, Yulong Wang, Xinwei He

TL;DR
This paper introduces TeDA, a novel test-time adaptation framework that enhances zero-shot 3D object retrieval by aligning 3D features with pre-trained vision-language models like CLIP, incorporating multi-view projections and textual descriptions.
Contribution
TeDA is the first approach to adapt a pretrained vision-language model for 3D object retrieval at test time, significantly improving zero-shot performance without extensive training.
Findings
TeDA outperforms state-of-the-art methods on four 3D retrieval benchmarks.
Incorporating textual descriptions further boosts retrieval accuracy.
Effective even with limited training data and using depth maps.
Abstract
Learning discriminative 3D representations that generalize well to unknown testing categories is an emerging requirement for many real-world 3D applications. Existing well-established methods often struggle to attain this goal due to insufficient 3D training data from broader concepts. Meanwhile, pre-trained large vision-language models (e.g., CLIP) have shown remarkable zero-shot generalization capabilities. Yet, they are limited in extracting suitable 3D representations due to substantial gaps between their 2D training and 3D testing distributions. To address these challenges, we propose Testing-time Distribution Alignment (TeDA), a novel framework that adapts a pretrained 2D vision-language model CLIP for unknown 3D object retrieval at test time. To our knowledge, it is the first work that studies the test-time adaptation of a vision-language model for 3D feature learning. TeDA…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Robotics and Sensor-Based Localization · Image Retrieval and Classification Techniques
MethodsContrastive Language-Image Pre-training
