Deep Learning Based Assessment of Synthetic Speech Naturalness
Gabriel Mittag, Sebastian M\"oller

TL;DR
This paper introduces a new deep learning model for objectively assessing the naturalness of synthetic speech, applicable across languages and trained on diverse datasets.
Contribution
It presents a novel CNN-LSTM based model for speech naturalness prediction, enhanced by transfer learning from speech quality models, and makes the tool publicly available.
Findings
Model trained on 16 datasets including Blizzard and Voice Conversion Challenges.
Transfer learning improves prediction reliability.
Model is language-independent and end-to-end.
Abstract
In this paper, we present a new objective prediction model for synthetic speech naturalness. It can be used to evaluate Text-To-Speech or Voice Conversion systems and works language independently. The model is trained end-to-end and based on a CNN-LSTM network that previously showed to give good results for speech quality estimation. We trained and tested the model on 16 different datasets, such as from the Blizzard Challenge and the Voice Conversion Challenge. Further, we show that the reliability of deep learning-based naturalness prediction can be improved by transfer learning from speech quality prediction models that are trained on objective POLQA scores. The proposed model is made publicly available and can, for example, be used to evaluate different TTS system configurations.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
