Stereo-Talker: Audio-driven 3D Human Synthesis with Prior-Guided Mixture-of-Experts

Xiang Deng; Youxin Pang; Xiaochen Zhao; Chao Xu; Lizhen Wang; Hongjiang Xiao; Shi Yan; Hongwen Zhang; Yebin Liu

arXiv:2410.23836·cs.CV·March 2, 2026

Stereo-Talker: Audio-driven 3D Human Synthesis with Prior-Guided Mixture-of-Experts

Xiang Deng, Youxin Pang, Xiaochen Zhao, Chao Xu, Lizhen Wang, Hongjiang Xiao, Shi Yan, Hongwen Zhang, Yebin Liu

PDF

Open Access

TL;DR

Stereo-Talker is a system that synthesizes realistic 3D talking videos from audio, combining advanced motion mapping, LLM priors, and a novel MoE-based diffusion model for high-quality, controllable human video generation.

Contribution

It introduces a two-stage framework integrating LLM priors and a prior-guided MoE mechanism for improved 3D human video synthesis from audio.

Findings

01

Achieves precise lip synchronization and expressive gestures.

02

Provides temporally consistent and photo-realistic videos.

03

Enables continuous viewpoint control.

Abstract

This paper introduces Stereo-Talker, a novel one-shot audio-driven human video synthesis system that generates 3D talking videos with precise lip synchronization, expressive body gestures, temporally consistent photo-realistic quality, and continuous viewpoint control. The process follows a two-stage approach. In the first stage, the system maps audio input to high-fidelity motion sequences, encompassing upper-body gestures and facial expressions. To enrich motion diversity and authenticity, large language model (LLM) priors are integrated with text-aligned semantic audio features, leveraging LLMs' cross-modal generalization power to enhance motion quality. In the second stage, we improve diffusion-based video generation models by incorporating a prior-guided Mixture-of-Experts (MoE) mechanism: a view-guided MoE focuses on view-specific attributes, while a mask-guided MoE enhances…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Motion and Animation · Music Technology and Sound Studies · Social Robot Interaction and HRI

MethodsMixture of Experts