Lift3D: Zero-Shot Lifting of Any 2D Vision Model to 3D
Mukund Varma T, Peihao Wang, Zhiwen Fan, Zhangyang Wang, Hao Su, Ravi, Ramamoorthi

TL;DR
Lift3D is a zero-shot approach that extends 2D vision models to 3D, enabling consistent multi-view predictions across various tasks without task-specific training.
Contribution
The paper introduces Lift3D, a novel method that generalizes 2D vision models to 3D, achieving zero-shot multi-view consistency for diverse vision tasks.
Findings
Outperforms task-specific 3D methods in several tasks
Works with models like DINO and CLIP without retraining
Enables 3D predictions for style transfer, segmentation, and more
Abstract
In recent years, there has been an explosion of 2D vision models for numerous tasks such as semantic segmentation, style transfer or scene editing, enabled by large-scale 2D image datasets. At the same time, there has been renewed interest in 3D scene representations such as neural radiance fields from multi-view images. However, the availability of 3D or multiview data is still substantially limited compared to 2D image datasets, making extending 2D vision models to 3D data highly desirable but also very challenging. Indeed, extending a single 2D vision operator like scene editing to 3D typically requires a highly creative method specialized to that task and often requires per-scene optimization. In this paper, we ask the question of whether any 2D vision model can be lifted to make 3D consistent predictions. We answer this question in the affirmative; our new Lift3D method trains to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Computer Graphics and Visualization Techniques · Image Enhancement Techniques
MethodsAttention Is All You Need · Linear Layer · Residual Connection · Layer Normalization · Softmax · Multi-Head Attention · Dense Connections · Vision Transformer · self-DIstillation with NO labels
