DINO Eats CLIP: Adapting Beyond Knowns for Open-set 3D Object Retrieval

Xinwei He; Yansong Zheng; Qianru Han; Zhichuan Wang; Yuxuan Cai; Yang Zhou; Jingbo Xia; Yulong Wang; Jinhai Xiang; Xiang Bai

arXiv:2604.19432·cs.CV·April 22, 2026

DINO Eats CLIP: Adapting Beyond Knowns for Open-set 3D Object Retrieval

Xinwei He, Yansong Zheng, Qianru Han, Zhichuan Wang, Yuxuan Cai, Yang Zhou, Jingbo Xia, Yulong Wang, Jinhai Xiang, Xiang Bai

PDF

TL;DR

This paper introduces DEC, a novel framework combining DINO and CLIP models with dynamic multi-view integration and virtual feature synthesis to improve open-set 3D object retrieval.

Contribution

It proposes a new method that leverages DINO for better fine-grained features and synthesizes virtual features to enhance open-set discrimination in 3D retrieval.

Findings

01

DEC outperforms previous methods on standard benchmarks.

02

Chunking and Adapting Module improves robustness of multi-view features.

03

Virtual Feature Synthesis significantly enhances open-set recognition.

Abstract

Vision foundation models have shown great promise for open-set 3D object retrieval (3DOR) through efficient adaptation to multi-view images. Leveraging semantically aligned latent space, previous work typically adapts the CLIP encoder to build view-based 3D descriptors. Despite CLIP's strong generalization ability, its lack of fine-grainedness prompted us to explore the potential of a more recent self-supervised encoder-DINO. To address this, we propose DINO Eats CLIP (DEC), a novel framework for dynamic multi-view integration that is regularized by synthesizing data for unseen classes. We first find that simply mean-pooling over view features from a frozen DINO backbone gives decent performance. Yet, further adaptation causes severe overfitting on average view patterns of known classes. To combat it, we then design a module named Chunking and Adapting Module (CAM). It segments…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.