TeDA: Boosting Vision-Lanuage Models for Zero-Shot 3D Object Retrieval   via Testing-time Distribution Alignment

Zhichuan Wang; Yang Zhou; Jinhai Xiang; Yulong Wang; Xinwei He

arXiv:2505.02325·cs.CV·May 6, 2025

TeDA: Boosting Vision-Lanuage Models for Zero-Shot 3D Object Retrieval via Testing-time Distribution Alignment

Zhichuan Wang, Yang Zhou, Jinhai Xiang, Yulong Wang, Xinwei He

PDF

Open Access 1 Repo

TL;DR

This paper introduces TeDA, a novel test-time adaptation framework that enhances zero-shot 3D object retrieval by aligning 3D features with pre-trained vision-language models like CLIP, incorporating multi-view projections and textual descriptions.

Contribution

TeDA is the first approach to adapt a pretrained vision-language model for 3D object retrieval at test time, significantly improving zero-shot performance without extensive training.

Findings

01

TeDA outperforms state-of-the-art methods on four 3D retrieval benchmarks.

02

Incorporating textual descriptions further boosts retrieval accuracy.

03

Effective even with limited training data and using depth maps.

Abstract

Learning discriminative 3D representations that generalize well to unknown testing categories is an emerging requirement for many real-world 3D applications. Existing well-established methods often struggle to attain this goal due to insufficient 3D training data from broader concepts. Meanwhile, pre-trained large vision-language models (e.g., CLIP) have shown remarkable zero-shot generalization capabilities. Yet, they are limited in extracting suitable 3D representations due to substantial gaps between their 2D training and 3D testing distributions. To address these challenges, we propose Testing-time Distribution Alignment (TeDA), a novel framework that adapts a pretrained 2D vision-language model CLIP for unknown 3D object retrieval at test time. To our knowledge, it is the first work that studies the test-time adaptation of a vision-language model for 3D feature learning. TeDA…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wangzhichuan123/teda
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Robotics and Sensor-Based Localization · Image Retrieval and Classification Techniques

MethodsContrastive Language-Image Pre-training