Dual Diffusion Models for Multi-modal Guided 3D Avatar Generation

Hong Li; Yutang Feng; Minqi Meng; Yichen Yang; Xuhui Liu; Baochang Zhang

arXiv:2603.04307·cs.CV·March 5, 2026

Dual Diffusion Models for Multi-modal Guided 3D Avatar Generation

Hong Li, Yutang Feng, Minqi Meng, Yichen Yang, Xuhui Liu, Baochang Zhang

PDF

Open Access

TL;DR

This paper introduces PromptAvatar, a dual diffusion model framework that rapidly generates high-quality 3D avatars from text or image prompts, overcoming limitations of prior methods in semantic control and efficiency.

Contribution

The paper presents a large-scale multi-modal dataset and a novel dual diffusion model approach that directly maps prompts to 3D avatars, eliminating iterative optimization.

Findings

01

Outperforms state-of-the-art in quality and detail

02

Generates avatars in under 10 seconds

03

Achieves superior semantic alignment

Abstract

Generating high-fidelity 3D avatars from text or image prompts is highly sought after in virtual reality and human-computer interaction. However, existing text-driven methods often rely on iterative Score Distillation Sampling (SDS) or CLIP optimization, which struggle with fine-grained semantic control and suffer from excessively slow inference. Meanwhile, image-driven approaches are severely bottlenecked by the scarcity and high acquisition cost of high-quality 3D facial scans, limiting model generalization. To address these challenges, we first construct a novel, large-scale dataset comprising over 100,000 pairs across four modalities: fine-grained textual descriptions, in-the-wild face images, high-quality light-normalized texture UV maps, and 3D geometric shapes. Leveraging this comprehensive dataset, we propose PromptAvatar, a framework featuring dual diffusion models.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace recognition and analysis · Generative Adversarial Networks and Image Synthesis · 3D Shape Modeling and Analysis