ConDense: Consistent 2D/3D Pre-training for Dense and Sparse Features   from Multi-View Images

Xiaoshuai Zhang; Zhicheng Wang; Howard Zhou; Soham Ghosh; Danushen; Gnanapragasam; Varun Jampani; Hao Su; Leonidas Guibas

arXiv:2408.17027·cs.CV·September 2, 2024

ConDense: Consistent 2D/3D Pre-training for Dense and Sparse Features from Multi-View Images

Xiaoshuai Zhang, Zhicheng Wang, Howard Zhou, Soham Ghosh, Danushen, Gnanapragasam, Varun Jampani, Hao Su, Leonidas Guibas

PDF

Open Access

TL;DR

ConDense introduces a novel 2D-3D joint training framework that leverages multi-view images and pre-trained 2D networks to create consistent, dense and sparse 3D features, improving various 3D understanding tasks.

Contribution

It proposes a new end-to-end training scheme for 2D and 3D feature consistency using volume rendering, enabling better 3D pre-training from 2D models and multi-view data.

Findings

01

Outperforms existing 3D pre-training methods in classification and segmentation.

02

Enables efficient 3D scene matching and retrieval without fine-tuning.

03

Produces more consistent and less noisy 2D features for 3D tasks.

Abstract

To advance the state of the art in the creation of 3D foundation models, this paper introduces the ConDense framework for 3D pre-training utilizing existing pre-trained 2D networks and large-scale multi-view datasets. We propose a novel 2D-3D joint training scheme to extract co-embedded 2D and 3D features in an end-to-end pipeline, where 2D-3D feature consistency is enforced through a volume rendering NeRF-like ray marching process. Using dense per pixel features we are able to 1) directly distill the learned priors from 2D models to 3D models and create useful 3D backbones, 2) extract more consistent and less noisy 2D features, 3) formulate a consistent embedding space where 2D, 3D, and other modalities of data (e.g., natural language prompts) can be jointly queried. Furthermore, besides dense features, ConDense can be trained to extract sparse features (e.g., key points), also with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Advanced Image and Video Retrieval Techniques · Robotics and Sensor-Based Localization