On the Use of Self-Supervised Speech Representations in Spontaneous Speech Synthesis
Siyang Wang, Gustav Eje Henter, Joakim Gustafson, \'Eva Sz\'ekely

TL;DR
This paper systematically compares six self-supervised speech representations across multiple layers for spontaneous speech synthesis and MOS prediction, providing insights into their optimal usage and generalizability.
Contribution
It extends the comparison of SSL models for spontaneous TTS to multiple models and layers, and evaluates SSL-based MOS prediction on spontaneous speech, offering comprehensive insights.
Findings
Certain SSL models and layers outperform others in spontaneous TTS.
SSL-based MOS prediction correlates well with human judgments in spontaneous speech.
Results are consistent across different spontaneous speech corpora.
Abstract
Self-supervised learning (SSL) speech representations learned from large amounts of diverse, mixed-quality speech data without transcriptions are gaining ground in many speech technology applications. Prior work has shown that SSL is an effective intermediate representation in two-stage text-to-speech (TTS) for both read and spontaneous speech. However, it is still not clear which SSL and which layer from each SSL model is most suited for spontaneous TTS. We address this shortcoming by extending the scope of comparison for SSL in spontaneous TTS to 6 different SSLs and 3 layers within each SSL. Furthermore, SSL has also shown potential in predicting the mean opinion scores (MOS) of synthesized speech, but this has only been done in read-speech MOS prediction. We extend an SSL-based MOS prediction framework previously developed for scoring read speech synthesis and evaluate its…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques
