Analysing Shortcomings of Statistical Parametric Speech Synthesis
Gustav Eje Henter, Simon King, Thomas Merritt, Gilles Degottex

TL;DR
This paper critically analyzes the limitations of statistical parametric speech synthesis (SPSS), focusing on vocoding and other factors, and introduces a methodology to empirically quantify their perceptual impacts.
Contribution
It provides a systematic approach to measure how specific assumptions and design choices affect SPSS quality, addressing a gap in perceptual evaluation.
Findings
Vocoding significantly impacts speech naturalness and intelligibility.
The methodology enables quantification of perceptual effects of various SPSS assumptions.
Comparison of different factors reveals their relative importance in speech quality.
Abstract
Output from statistical parametric speech synthesis (SPSS) remains noticeably worse than natural speech recordings in terms of quality, naturalness, speaker similarity, and intelligibility in noise. There are many hypotheses regarding the origins of these shortcomings, but these hypotheses are often kept vague and presented without empirical evidence that could confirm and quantify how a specific shortcoming contributes to imperfections in the synthesised speech. Throughout speech synthesis literature, surprisingly little work is dedicated towards identifying the perceptually most important problems in speech synthesis, even though such knowledge would be of great value for creating better SPSS systems. In this book chapter, we analyse some of the shortcomings of SPSS. In particular, we discuss issues with vocoding and present a general methodology for quantifying the effect of any of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
