ConDense: Consistent 2D/3D Pre-training for Dense and Sparse Features from Multi-View Images
Xiaoshuai Zhang, Zhicheng Wang, Howard Zhou, Soham Ghosh, Danushen, Gnanapragasam, Varun Jampani, Hao Su, Leonidas Guibas

TL;DR
ConDense introduces a novel 2D-3D joint training framework that leverages multi-view images and pre-trained 2D networks to create consistent, dense and sparse 3D features, improving various 3D understanding tasks.
Contribution
It proposes a new end-to-end training scheme for 2D and 3D feature consistency using volume rendering, enabling better 3D pre-training from 2D models and multi-view data.
Findings
Outperforms existing 3D pre-training methods in classification and segmentation.
Enables efficient 3D scene matching and retrieval without fine-tuning.
Produces more consistent and less noisy 2D features for 3D tasks.
Abstract
To advance the state of the art in the creation of 3D foundation models, this paper introduces the ConDense framework for 3D pre-training utilizing existing pre-trained 2D networks and large-scale multi-view datasets. We propose a novel 2D-3D joint training scheme to extract co-embedded 2D and 3D features in an end-to-end pipeline, where 2D-3D feature consistency is enforced through a volume rendering NeRF-like ray marching process. Using dense per pixel features we are able to 1) directly distill the learned priors from 2D models to 3D models and create useful 3D backbones, 2) extract more consistent and less noisy 2D features, 3) formulate a consistent embedding space where 2D, 3D, and other modalities of data (e.g., natural language prompts) can be jointly queried. Furthermore, besides dense features, ConDense can be trained to extract sparse features (e.g., key points), also with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Advanced Image and Video Retrieval Techniques · Robotics and Sensor-Based Localization
