Audiovisual Speech Synthesis using Tacotron2

Ahmed Hussen Abdelaziz; Anushree Prasanna Kumar; Chloe Seivwright,; Gabriele Fanelli; Justin Binder; Yannis Stylianou; Sachin Kajarekar

arXiv:2008.00620·eess.AS·August 31, 2021·1 cites

Audiovisual Speech Synthesis using Tacotron2

Ahmed Hussen Abdelaziz, Anushree Prasanna Kumar, Chloe Seivwright,, Gabriele Fanelli, Justin Binder, Yannis Stylianou, Sachin Kajarekar

PDF

Open Access

TL;DR

This paper presents two audiovisual speech synthesis systems for 3D face models, achieving near-human quality in generating synchronized speech and facial movements, with one end-to-end and one modular approach.

Contribution

It introduces and compares an end-to-end and a modular audiovisual speech synthesis system based on Tacotron2, incorporating emotion embeddings for expressive speech.

Findings

01

End-to-end system achieves MOS of 4.1, matching ground truth.

02

Modular system achieves MOS of 3.9, offering greater flexibility.

03

Both systems produce near-human-like audiovisual speech.

Abstract

Audiovisual speech synthesis is the problem of synthesizing a talking face while maximizing the coherency of the acoustic and visual speech. In this paper, we propose and compare two audiovisual speech synthesis systems for 3D face models. The first system is the AVTacotron2, which is an end-to-end text-to-audiovisual speech synthesizer based on the Tacotron2 architecture. AVTacotron2 converts a sequence of phonemes representing the sentence to synthesize into a sequence of acoustic features and the corresponding controllers of a face model. The output acoustic features are used to condition a WaveRNN to reconstruct the speech waveform, and the output facial controllers are used to generate the corresponding video of the talking face. The second audiovisual speech synthesis system is modular, where acoustic speech is synthesized from text using the traditional Tacotron2. The…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace recognition and analysis · Speech and Audio Processing · Generative Adversarial Networks and Image Synthesis

MethodsTanh Activation · *Communicated@Fast*How Do I Communicate to Expedia? · Softmax · Sigmoid Activation · WaveRNN