Learning to Dub Movies via Hierarchical Prosody Models
Gaoxiang Cong, Liang Li, Yuankai Qi, Zhengjun Zha, Qi Wu, Wenyu Wang,, Bin Jiang, Ming-Hsuan Yang, Qingming Huang

TL;DR
This paper introduces a hierarchical prosody modeling approach for movie dubbing that leverages visual cues from lip, face, and scene to generate emotionally accurate speech matching video content.
Contribution
A novel hierarchical prosody model that integrates visual information for emotion-aware speech synthesis in movie dubbing tasks.
Findings
Outperforms previous methods on Chem and V2C benchmarks.
Effectively captures emotion and speech dynamics from video cues.
Generates speech with improved emotional and prosodic accuracy.
Abstract
Given a piece of text, a video clip and a reference audio, the movie dubbing (also known as visual voice clone V2C) task aims to generate speeches that match the speaker's emotion presented in the video using the desired speaker voice as reference. V2C is more challenging than conventional text-to-speech tasks as it additionally requires the generated speech to exactly match the varying emotions and speaking speed presented in the video. Unlike previous works, we propose a novel movie dubbing architecture to tackle these problems via hierarchical prosody modelling, which bridges the visual information to corresponding speech prosody from three aspects: lip, face, and scene. Specifically, we align lip movement to the speech duration, and convey facial expression to speech energy and pitch via attention mechanism based on valence and arousal representations inspired by recent psychology…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Generative Adversarial Networks and Image Synthesis · Music and Audio Processing
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · ALIGN · Contrastive Language-Image Pre-training
