ET3D: Efficient Text-to-3D Generation via Multi-View Distillation
Yiming Chen, Zhiqi Li, Peidong Liu

TL;DR
ET3D introduces a fast, efficient text-to-3D generation method that leverages pre-trained text-to-image models and a 3D GAN, enabling rapid 3D asset creation without 3D training data.
Contribution
The paper presents a novel approach that distills pre-trained text-to-image diffusion models into a 3D GAN for real-time text-to-3D generation.
Findings
Generation time is approximately 8 ms per 3D asset.
No 3D training data is required for the method.
Achieves rapid 3D generation comparable to real-time applications.
Abstract
Recent breakthroughs in text-to-image generation has shown encouraging results via large generative models. Due to the scarcity of 3D assets, it is hardly to transfer the success of text-to-image generation to that of text-to-3D generation. Existing text-to-3D generation methods usually adopt the paradigm of DreamFusion, which conducts per-asset optimization by distilling a pretrained text-to-image diffusion model. The generation speed usually ranges from several minutes to tens of minutes per 3D asset, which degrades the user experience and also imposes a burden to the service providers due to the high computational budget. In this work, we present an efficient text-to-3D generation method, which requires only around 8 to generate a 3D asset given the text prompt on a consumer graphic card. The main insight is that we exploit the images generated by a large pre-trained…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Computer Graphics and Visualization Techniques · Image Processing and 3D Reconstruction
Methodstravel james · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Diffusion
