On the Use of Self-Supervised Speech Representations in Spontaneous   Speech Synthesis

Siyang Wang; Gustav Eje Henter; Joakim Gustafson; \'Eva Sz\'ekely

arXiv:2307.05132·eess.AS·July 12, 2023

On the Use of Self-Supervised Speech Representations in Spontaneous Speech Synthesis

Siyang Wang, Gustav Eje Henter, Joakim Gustafson, \'Eva Sz\'ekely

PDF

Open Access

TL;DR

This paper systematically compares six self-supervised speech representations across multiple layers for spontaneous speech synthesis and MOS prediction, providing insights into their optimal usage and generalizability.

Contribution

It extends the comparison of SSL models for spontaneous TTS to multiple models and layers, and evaluates SSL-based MOS prediction on spontaneous speech, offering comprehensive insights.

Findings

01

Certain SSL models and layers outperform others in spontaneous TTS.

02

SSL-based MOS prediction correlates well with human judgments in spontaneous speech.

03

Results are consistent across different spontaneous speech corpora.

Abstract

Self-supervised learning (SSL) speech representations learned from large amounts of diverse, mixed-quality speech data without transcriptions are gaining ground in many speech technology applications. Prior work has shown that SSL is an effective intermediate representation in two-stage text-to-speech (TTS) for both read and spontaneous speech. However, it is still not clear which SSL and which layer from each SSL model is most suited for spontaneous TTS. We address this shortcoming by extending the scope of comparison for SSL in spontaneous TTS to 6 different SSLs and 3 layers within each SSL. Furthermore, SSL has also shown potential in predicting the mean opinion scores (MOS) of synthesized speech, but this has only been done in read-speech MOS prediction. We extend an SSL-based MOS prediction framework previously developed for scoring read speech synthesis and evaluate its…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques