DINO Eats CLIP: Adapting Beyond Knowns for Open-set 3D Object Retrieval
Xinwei He, Yansong Zheng, Qianru Han, Zhichuan Wang, Yuxuan Cai, Yang Zhou, Jingbo Xia, Yulong Wang, Jinhai Xiang, Xiang Bai

TL;DR
This paper introduces DEC, a novel framework combining DINO and CLIP models with dynamic multi-view integration and virtual feature synthesis to improve open-set 3D object retrieval.
Contribution
It proposes a new method that leverages DINO for better fine-grained features and synthesizes virtual features to enhance open-set discrimination in 3D retrieval.
Findings
DEC outperforms previous methods on standard benchmarks.
Chunking and Adapting Module improves robustness of multi-view features.
Virtual Feature Synthesis significantly enhances open-set recognition.
Abstract
Vision foundation models have shown great promise for open-set 3D object retrieval (3DOR) through efficient adaptation to multi-view images. Leveraging semantically aligned latent space, previous work typically adapts the CLIP encoder to build view-based 3D descriptors. Despite CLIP's strong generalization ability, its lack of fine-grainedness prompted us to explore the potential of a more recent self-supervised encoder-DINO. To address this, we propose DINO Eats CLIP (DEC), a novel framework for dynamic multi-view integration that is regularized by synthesizing data for unseen classes. We first find that simply mean-pooling over view features from a frozen DINO backbone gives decent performance. Yet, further adaptation causes severe overfitting on average view patterns of known classes. To combat it, we then design a module named Chunking and Adapting Module (CAM). It segments…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
