Audiovisual Speech Synthesis using Tacotron2
Ahmed Hussen Abdelaziz, Anushree Prasanna Kumar, Chloe Seivwright,, Gabriele Fanelli, Justin Binder, Yannis Stylianou, Sachin Kajarekar

TL;DR
This paper presents two audiovisual speech synthesis systems for 3D face models, achieving near-human quality in generating synchronized speech and facial movements, with one end-to-end and one modular approach.
Contribution
It introduces and compares an end-to-end and a modular audiovisual speech synthesis system based on Tacotron2, incorporating emotion embeddings for expressive speech.
Findings
End-to-end system achieves MOS of 4.1, matching ground truth.
Modular system achieves MOS of 3.9, offering greater flexibility.
Both systems produce near-human-like audiovisual speech.
Abstract
Audiovisual speech synthesis is the problem of synthesizing a talking face while maximizing the coherency of the acoustic and visual speech. In this paper, we propose and compare two audiovisual speech synthesis systems for 3D face models. The first system is the AVTacotron2, which is an end-to-end text-to-audiovisual speech synthesizer based on the Tacotron2 architecture. AVTacotron2 converts a sequence of phonemes representing the sentence to synthesize into a sequence of acoustic features and the corresponding controllers of a face model. The output acoustic features are used to condition a WaveRNN to reconstruct the speech waveform, and the output facial controllers are used to generate the corresponding video of the talking face. The second audiovisual speech synthesis system is modular, where acoustic speech is synthesized from text using the traditional Tacotron2. The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis · Speech and Audio Processing · Generative Adversarial Networks and Image Synthesis
MethodsTanh Activation · *Communicated@Fast*How Do I Communicate to Expedia? · Softmax · Sigmoid Activation · WaveRNN
