Describe, Adapt and Combine: Empowering CLIP Encoders for Open-set 3D Object Retrieval

Zhichuan Wang; Yang Zhou; Zhe Liu; Rui Yu; Song Bai; Yulong Wang; Xinwei He; Xiang Bai

arXiv:2507.21489·cs.CV·July 30, 2025

Describe, Adapt and Combine: Empowering CLIP Encoders for Open-set 3D Object Retrieval

Zhichuan Wang, Yang Zhou, Zhe Liu, Rui Yu, Song Bai, Yulong Wang, Xinwei He, Xiang Bai

PDF

TL;DR

This paper introduces DAC, a framework that enhances CLIP's capabilities for open-set 3D object retrieval by integrating a multi-modal large language model and a novel adaptation method, significantly improving generalization to unseen categories.

Contribution

The paper proposes a novel framework called Describe, Adapt and Combine (DAC) that synergizes CLIP with a multi-modal large language model for open-set 3D object retrieval, introducing AB-LoRA for better generalization.

Findings

01

DAC outperforms prior methods by +10.01% mAP on four datasets.

02

The framework generalizes well across image-based and cross-dataset scenarios.

03

AB-LoRA effectively alleviates overfitting and enhances unseen category recognition.

Abstract

Open-set 3D object retrieval (3DOR) is an emerging task aiming to retrieve 3D objects of unseen categories beyond the training set. Existing methods typically utilize all modalities (i.e., voxels, point clouds, multi-view images) and train specific backbones before fusion. However, they still struggle to produce generalized representations due to insufficient 3D training data. Being contrastively pre-trained on web-scale image-text pairs, CLIP inherently produces generalized representations for a wide range of downstream tasks. Building upon it, we present a simple yet effective framework named Describe, Adapt and Combine (DAC) by taking only multi-view images for open-set 3DOR. DAC innovatively synergizes a CLIP model with a multi-modal large language model (MLLM) to learn generalized 3D representations, where the MLLM is used for dual purposes. First, it describes the seen category…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.