MANGO:Natural Multi-speaker 3D Talking Head Generation via 2D-Lifted Enhancement
Lei Zhu, Lijian Lin, Ye Zhu, Jiahao Wu, Xuehan Hou, Yu Li, Yunfei Liu, Jie Chen

TL;DR
MANGO is a novel two-stage framework for natural multi-speaker 3D talking head generation that leverages image-level supervision and a new dataset to improve realism and conversational fluidity.
Contribution
The paper introduces MANGO, a two-stage method combining diffusion-based modeling and photometric supervision, and presents MANGO-Dialog, a large dataset for multi-speaker 3D conversational heads.
Findings
Achieves high realism in multi-speaker 3D dialogue modeling
Outperforms existing methods in accuracy and naturalness
Provides a new dataset for multi-person conversational head generation
Abstract
Current audio-driven 3D head generation methods mainly focus on single-speaker scenarios, lacking natural, bidirectional listen-and-speak interaction. Achieving seamless conversational behavior, where speaking and listening states transition fluidly remains a key challenge. Existing 3D conversational avatar approaches rely on error-prone pseudo-3D labels that fail to capture fine-grained facial dynamics. To address these limitations, we introduce a novel two-stage framework MANGO, which leveraging pure image-level supervision by alternately training to mitigate the noise introduced by pseudo-3D labels, thereby achieving better alignment with real-world conversational behaviors. Specifically, in the first stage, a diffusion-based transformer with a dual-audio interaction module models natural 3D motion from multi-speaker audio. In the second stage, we use a fast 3D Gaussian Renderer to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis · Generative Adversarial Networks and Image Synthesis · Social Robot Interaction and HRI
