Learning to Dub Movies via Hierarchical Prosody Models

Gaoxiang Cong; Liang Li; Yuankai Qi; Zhengjun Zha; Qi Wu; Wenyu Wang,; Bin Jiang; Ming-Hsuan Yang; Qingming Huang

arXiv:2212.04054·cs.CL·April 5, 2023

Learning to Dub Movies via Hierarchical Prosody Models

Gaoxiang Cong, Liang Li, Yuankai Qi, Zhengjun Zha, Qi Wu, Wenyu Wang,, Bin Jiang, Ming-Hsuan Yang, Qingming Huang

PDF

Open Access 1 Repo

TL;DR

This paper introduces a hierarchical prosody modeling approach for movie dubbing that leverages visual cues from lip, face, and scene to generate emotionally accurate speech matching video content.

Contribution

A novel hierarchical prosody model that integrates visual information for emotion-aware speech synthesis in movie dubbing tasks.

Findings

01

Outperforms previous methods on Chem and V2C benchmarks.

02

Effectively captures emotion and speech dynamics from video cues.

03

Generates speech with improved emotional and prosodic accuracy.

Abstract

Given a piece of text, a video clip and a reference audio, the movie dubbing (also known as visual voice clone V2C) task aims to generate speeches that match the speaker's emotion presented in the video using the desired speaker voice as reference. V2C is more challenging than conventional text-to-speech tasks as it additionally requires the generated speech to exactly match the varying emotions and speaking speed presented in the video. Unlike previous works, we propose a novel movie dubbing architecture to tackle these problems via hierarchical prosody modelling, which bridges the visual information to corresponding speech prosody from three aspects: lip, face, and scene. Specifically, we align lip movement to the speech duration, and convey facial expression to speech energy and pitch via attention mechanism based on valence and arousal representations inspired by recent psychology…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

galaxycong/hpmdubbing
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Generative Adversarial Networks and Image Synthesis · Music and Audio Processing

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · ALIGN · Contrastive Language-Image Pre-training