The Theory behind Controllable Expressive Speech Synthesis: a   Cross-disciplinary Approach

No\'e Tits; Kevin El Haddad; Thierry Dutoit

arXiv:1910.06234·eess.AS·October 15, 2019

The Theory behind Controllable Expressive Speech Synthesis: a Cross-disciplinary Approach

No\'e Tits, Kevin El Haddad, Thierry Dutoit

PDF

Open Access

TL;DR

This paper provides a comprehensive overview of controllable expressive speech synthesis, emphasizing the technical paradigms, historical methods, and recent deep learning approaches, integrating cross-disciplinary insights for improved synthesis control.

Contribution

It offers a cross-disciplinary theoretical framework for expressive speech synthesis, highlighting recent deep learning techniques and their integration with traditional paradigms.

Findings

01

Overview of speech representation and encoding methods

02

Historical review of TTS synthesis paradigms

03

Discussion of deep learning models like seq2seq, CNNs, RNNs, and attention mechanisms

Abstract

As part of the Human-Computer Interaction field, Expressive speech synthesis is a very rich domain as it requires knowledge in areas such as machine learning, signal processing, sociology, psychology. In this Chapter, we will focus mostly on the technical side. From the recording of expressive speech to its modeling, the reader will have an overview of the main paradigms used in this field, through some of the most prominent systems and methods. We explain how speech can be represented and encoded with audio features. We present a history of the main methods of Text-to-Speech synthesis: concatenative, parametric and statistical parametric speech synthesis. Finally, we focus on the last one, with the last techniques modeling Text-to-Speech synthesis as a sequence-to-sequence problem. This enables the use of Deep Learning blocks such as Convolutional and Recurrent Neural Networks as well…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis