Expressivity and Speech Synthesis
Andreas Triantafyllopoulos, Bj\"orn W. Schuller

TL;DR
This paper reviews recent advances in speech synthesis, emphasizing the progress towards expressive, high-fidelity speech generation and discussing societal implications and ethical considerations.
Contribution
It summarizes methodological developments in expressive speech synthesis and explores future directions for creating more complex, emotionally rich speech outputs.
Findings
Achieved high-fidelity, expressive speech synthesis for isolated utterances
Identified key methodological advances enabling expressivity in speech synthesis
Discussed societal risks and ethical considerations of ESS technology
Abstract
Imbuing machines with the ability to talk has been a longtime pursuit of artificial intelligence (AI) research. From the very beginning, the community has not only aimed to synthesise high-fidelity speech that accurately conveys the semantic meaning of an utterance, but also to colour it with inflections that cover the same range of affective expressions that humans are capable of. After many years of research, it appears that we are on the cusp of achieving this when it comes to single, isolated utterances. This unveils an abundance of potential avenues to explore when it comes to combining these single utterances with the aim of synthesising more complex, longer-term behaviours. In the present chapter, we outline the methodological advances that brought us so far and sketch out the ongoing efforts to reach that coveted next level of artificial expressivity. We also discuss the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Communication and Language
