TL;DR
Duoduo CLIP introduces a multi-view image-based 3D representation learning model that leverages 2D priors, reduces computational costs, and improves generalization and fine-grained retrieval performance.
Contribution
The paper presents a novel multi-view image approach for 3D understanding that is more efficient, flexible, and better aligned with text than existing point cloud methods.
Findings
Outperforms point cloud methods in generalization and efficiency.
Requires significantly less training time and computational resources.
Achieves superior performance in fine-grained text-to-shape retrieval.
Abstract
We introduce Duoduo CLIP, a model for 3D representation learning that learns shape encodings from multi-view images instead of point clouds. The choice of multi-view images allows us to leverage 2D priors from off-the-shelf CLIP models to facilitate fine-tuning with 3D data. Our approach not only shows better generalization compared to existing point cloud methods, but also reduces GPU requirements and training time. In addition, the model is modified with cross-view attention to leverage information across multiple frames of the object which further boosts performance. Notably, our model is permutation invariant to the order of multi-view images while being pose-free. Compared to the current SOTA point cloud method that requires 480 A100 hours to train 1 billion model parameters we only require 57 A5000 hours and 87 million parameters. Multi-view images also provide more flexibility…
Peer Reviews
Decision·ICLR 2025 Poster
1. The paper is clearly written and easy to follow. Extensive experiments are conducted for point cloud classification and text-based retrieval. 2. The proposed method achieves state-of-the-art performance on these tasks while significantly reducing computational costs.
1. The zero-shot CLIP baseline is insufficient to demonstrate the proposed model’s advantages, as the test set is specific to 3D renderings, while the original CLIP model is primarily trained on natural images. To provide a fair comparison, the authors should fine-tune the original CLIP model using the same datasets and training strategies as the proposed model, and report the results. 2. Representing 3D shapes with multi-view images is one of the most straightforward approaches and has been us
- The paper is well-written. The figures and well-made and easy to understand. The overall presentation is good. - The idea is simple and easy to understand. The architecture proposed might be reused for other multiview image tasks. - The quantitative results are pretty strong and convincing.
- It's unclear to me why this task is important. The proposed method seems pretty incremental and I don't see a clear insights or surprising findings reviewed by the paper
### Originality - It is reasonable to use pre-trained CLIP encoders and fine-tune them for 3D shape understanding from multiple-views. This allows for training on more data and taking advantage of the large-scale pre-trained representation. - Attention implemented across multiple views of a single object is reasonable, and ablation show that it helps. ### Quality - The investigation into the effect of replacing the point-cloud encoder with a multi-view image encoder is thorough, including analy
### Originality - The main technical novelty is replacing a point cloud encoder with a multi-view CLIP image encoder, and tuning this multi-view encoder. While effective, the novelty is minor. It would also be good to understand the significance of what pre-trained representation (CLIP vs. DINO for example) is used to initialize the multi-view encoder ### Quality - Accuracy of evaluation: Both the Objaverse LVIS and the new MVPNet use somewhat automated procedures using off-the-shelf VLMs to cre
Code & Models
Videos
Taxonomy
TopicsColorectal Cancer Screening and Detection
MethodsSoftmax · Attention Is All You Need · Contrastive Language-Image Pre-training
