Dual Diffusion Models for Multi-modal Guided 3D Avatar Generation
Hong Li, Yutang Feng, Minqi Meng, Yichen Yang, Xuhui Liu, Baochang Zhang

TL;DR
This paper introduces PromptAvatar, a dual diffusion model framework that rapidly generates high-quality 3D avatars from text or image prompts, overcoming limitations of prior methods in semantic control and efficiency.
Contribution
The paper presents a large-scale multi-modal dataset and a novel dual diffusion model approach that directly maps prompts to 3D avatars, eliminating iterative optimization.
Findings
Outperforms state-of-the-art in quality and detail
Generates avatars in under 10 seconds
Achieves superior semantic alignment
Abstract
Generating high-fidelity 3D avatars from text or image prompts is highly sought after in virtual reality and human-computer interaction. However, existing text-driven methods often rely on iterative Score Distillation Sampling (SDS) or CLIP optimization, which struggle with fine-grained semantic control and suffer from excessively slow inference. Meanwhile, image-driven approaches are severely bottlenecked by the scarcity and high acquisition cost of high-quality 3D facial scans, limiting model generalization. To address these challenges, we first construct a novel, large-scale dataset comprising over 100,000 pairs across four modalities: fine-grained textual descriptions, in-the-wild face images, high-quality light-normalized texture UV maps, and 3D geometric shapes. Leveraging this comprehensive dataset, we propose PromptAvatar, a framework featuring dual diffusion models.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis · Generative Adversarial Networks and Image Synthesis · 3D Shape Modeling and Analysis
