MANGO:Natural Multi-speaker 3D Talking Head Generation via 2D-Lifted Enhancement

Lei Zhu; Lijian Lin; Ye Zhu; Jiahao Wu; Xuehan Hou; Yu Li; Yunfei Liu; Jie Chen

arXiv:2601.01749·cs.CV·January 6, 2026

MANGO:Natural Multi-speaker 3D Talking Head Generation via 2D-Lifted Enhancement

Lei Zhu, Lijian Lin, Ye Zhu, Jiahao Wu, Xuehan Hou, Yu Li, Yunfei Liu, Jie Chen

PDF

Open Access

TL;DR

MANGO is a novel two-stage framework for natural multi-speaker 3D talking head generation that leverages image-level supervision and a new dataset to improve realism and conversational fluidity.

Contribution

The paper introduces MANGO, a two-stage method combining diffusion-based modeling and photometric supervision, and presents MANGO-Dialog, a large dataset for multi-speaker 3D conversational heads.

Findings

01

Achieves high realism in multi-speaker 3D dialogue modeling

02

Outperforms existing methods in accuracy and naturalness

03

Provides a new dataset for multi-person conversational head generation

Abstract

Current audio-driven 3D head generation methods mainly focus on single-speaker scenarios, lacking natural, bidirectional listen-and-speak interaction. Achieving seamless conversational behavior, where speaking and listening states transition fluidly remains a key challenge. Existing 3D conversational avatar approaches rely on error-prone pseudo-3D labels that fail to capture fine-grained facial dynamics. To address these limitations, we introduce a novel two-stage framework MANGO, which leveraging pure image-level supervision by alternately training to mitigate the noise introduced by pseudo-3D labels, thereby achieving better alignment with real-world conversational behaviors. Specifically, in the first stage, a diffusion-based transformer with a dual-audio interaction module models natural 3D motion from multi-speaker audio. In the second stage, we use a fast 3D Gaussian Renderer to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace recognition and analysis · Generative Adversarial Networks and Image Synthesis · Social Robot Interaction and HRI