Audio-Driven Dubbing for User Generated Contents via Style-Aware   Semi-Parametric Synthesis

Linsen Song; Wayne Wu; Chaoyou Fu; Chen Change Loy; Ran He

arXiv:2309.00030·cs.CV·September 4, 2023

Audio-Driven Dubbing for User Generated Contents via Style-Aware Semi-Parametric Synthesis

Linsen Song, Wayne Wu, Chaoyou Fu, Chen Change Loy, Ran He

PDF

TL;DR

This paper presents a style-aware semi-parametric synthesis method for audio-driven dubbing tailored for user-generated content, enabling quick adaptation to new speakers with limited data and faster processing.

Contribution

It introduces a novel Style Translation Network with cross-modal AdaIN and a semi-parametric renderer with retrieve-warp-refine pipeline for efficient, style-preserving dubbing in UGC.

Findings

01

Generates accurate speaking styles with limited data

02

Reduces training data and time significantly

03

Achieves faster testing speed than recent methods

Abstract

Existing automated dubbing methods are usually designed for Professionally Generated Content (PGC) production, which requires massive training data and training time to learn a person-specific audio-video mapping. In this paper, we investigate an audio-driven dubbing method that is more feasible for User Generated Content (UGC) production. There are two unique challenges to design a method for UGC: 1) the appearances of speakers are diverse and arbitrary as the method needs to generalize across users; 2) the available video data of one speaker are very limited. In order to tackle the above challenges, we first introduce a new Style Translation Network to integrate the speaking style of the target and the speaking content of the source via a cross-modal AdaIN module. It enables our model to quickly adapt to a new speaker. Then, we further develop a semi-parametric video renderer, which…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings