VisualTTS: TTS with Accurate Lip-Speech Synchronization for Automatic   Voice Over

Junchen Lu; Berrak Sisman; Rui Liu; Mingyang Zhang; Haizhou Li

arXiv:2110.03342·eess.AS·March 3, 2022·1 cites

VisualTTS: TTS with Accurate Lip-Speech Synchronization for Automatic Voice Over

Junchen Lu, Berrak Sisman, Rui Liu, Mingyang Zhang, Haizhou Li

PDF

Open Access

TL;DR

This paper introduces VisualTTS, a novel text-to-speech model that synthesizes speech synchronized with silent videos, enabling automatic voice-over with precise lip-speech alignment, advancing multimedia and dubbing applications.

Contribution

VisualTTS is the first TTS model conditioned on visual lip input, using innovative attention and fusion mechanisms for accurate lip-speech synchronization.

Findings

01

Achieves superior lip-speech synchronization compared to baselines.

02

Outperforms existing systems in speech naturalness and alignment accuracy.

03

Demonstrates effectiveness on diverse video datasets.

Abstract

In this paper, we formulate a novel task to synthesize speech in sync with a silent pre-recorded video, denoted as automatic voice over (AVO). Unlike traditional speech synthesis, AVO seeks to generate not only human-sounding speech, but also perfect lip-speech synchronization. A natural solution to AVO is to condition the speech rendering on the temporal progression of lip sequence in the video. We propose a novel text-to-speech model that is conditioned on visual input, named VisualTTS, for accurate lip-speech synchronization. The proposed VisualTTS adopts two novel mechanisms that are 1) textual-visual attention, and 2) visual fusion strategy during acoustic decoding, which both contribute to forming accurate alignment between the input text content and lip motion in input lip sequence. Experimental results show that VisualTTS achieves accurate lip-speech synchronization and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Face recognition and analysis · Video Analysis and Summarization