Learning Robust 3D Representation from CLIP via Dual Denoising

Shuqing Luo; Bowen Qu; Wei Gao

arXiv:2407.00905·cs.CV·July 2, 2024

Learning Robust 3D Representation from CLIP via Dual Denoising

Shuqing Luo, Bowen Qu, Wei Gao

PDF

Open Access

TL;DR

This paper introduces Dual Denoising, a framework that enhances the robustness and generalization of 3D representations learned from CLIP, especially against adversarial attacks, without requiring adversarial training.

Contribution

It proposes a novel dual denoising framework combining a proxy task and feature denoising network for robust 3D pre-training from CLIP.

Findings

01

Improves 3D representation performance in zero-shot settings

02

Enhances adversarial robustness without adversarial training

03

Effective in cross-domain point cloud generalization

Abstract

In this paper, we explore a critical yet under-investigated issue: how to learn robust and well-generalized 3D representation from pre-trained vision language models such as CLIP. Previous works have demonstrated that cross-modal distillation can provide rich and useful knowledge for 3D data. However, like most deep learning models, the resultant 3D learning network is still vulnerable to adversarial attacks especially the iterative attack. In this work, we propose Dual Denoising, a novel framework for learning robust and well-generalized 3D representations from CLIP. It combines a denoising-based proxy task with a novel feature denoising network for 3D pre-training. Additionally, we propose utilizing parallel noise inference to enhance the generalization of point cloud features under cross domain settings. Experiments show that our model can effectively improve the representation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Topics3D Surveying and Cultural Heritage · Advanced Image and Video Retrieval Techniques · Advanced Vision and Imaging

MethodsContrastive Language-Image Pre-training