AudioVisual Speech Synthesis: A brief literature review

Efthymios Georgiou; Athanasios Katsamanis

arXiv:2103.03927·cs.SD·March 9, 2021

AudioVisual Speech Synthesis: A brief literature review

Efthymios Georgiou, Athanasios Katsamanis

PDF

Open Access

TL;DR

This literature review explores audiovisual speech synthesis by analyzing the separate components of text-to-speech conversion and talking head animation, highlighting various models and their advantages and disadvantages.

Contribution

It provides a comprehensive categorization and discussion of existing methods in audiovisual speech synthesis, emphasizing the importance of facial models and intermediate representations.

Findings

01

Different TTS models map text to acoustic features

02

Voice-driven animation approaches vary by facial model type

03

Review highlights strengths and weaknesses of various methods

Abstract

This brief literature review studies the problem of audiovisual speech synthesis, which is the problem of generating an animated talking head given a text as input. Due to the high complexity of this problem, we approach it as the composition of two problems. Specifically, that of Text-to-Speech (TTS) synthesis as well as the voice-driven talking head animation. For TTS, we present models that are used to map text to intermediate acoustic representations, e.g. mel-spectrograms, as well as models that generate voice signals conditioned on these intermediate representations, i.e vocoders. For the talking-head animation problem, we categorize approaches based on whether they produce human faces or anthropomorphic figures. An attempt is also made to discuss the importance of the choice of facial models in the second case. Throughout the review, we briefly describe the most important work in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace recognition and analysis · Generative Adversarial Networks and Image Synthesis · Speech and Audio Processing