MusicFace: Music-driven Expressive Singing Face Synthesis
Pengfei Liu, Wenjin Deng, Hengda Li, Jintai Wang, Yinglin Zheng, Yiwei, Ding, Xiaohu Guo, and Ming Zeng

TL;DR
This paper introduces a novel method for synthesizing realistic singing faces driven by music signals, effectively modeling facial motions and expressions by decoupling voice and background music streams, and demonstrating superior results over existing approaches.
Contribution
The paper proposes a decouple-and-fuse strategy for music-driven facial synthesis, along with a new dataset and detailed modeling of facial motion components, advancing the realism and expressiveness of singing face generation.
Findings
Outperforms state-of-the-art methods qualitatively and quantitatively.
Successfully models lip, facial expression, head pose, and eye states.
Introduces a new SingingFace Dataset for training and evaluation.
Abstract
It is still an interesting and challenging problem to synthesize a vivid and realistic singing face driven by music signal. In this paper, we present a method for this task with natural motions of the lip, facial expression, head pose, and eye states. Due to the coupling of the mixed information of human voice and background music in common signals of music audio, we design a decouple-and-fuse strategy to tackle the challenge. We first decompose the input music audio into human voice stream and background music stream. Due to the implicit and complicated correlation between the two-stream input signals and the dynamics of the facial expressions, head motions and eye states, we model their relationship with an attention scheme, where the effects of the two streams are fused seamlessly. Furthermore, to improve the expressiveness of the generated results, we propose to decompose head…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Face recognition and analysis · Hearing Loss and Rehabilitation
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
