Enhancing TTS Stability in Hebrew using Discrete Semantic Units
Ella Zeldes, Or Tal, Yossi Adi

TL;DR
This paper presents a novel TTS approach using discrete semantic units derived from HuBERT codes, significantly improving stability and robustness in Hebrew speech synthesis while maintaining naturalness and speaker similarity.
Contribution
Introduces LOTHM, a TTS method leveraging self-supervised semantic units to enhance stability and reduce diacritic dependency, applicable across languages.
Findings
Achieves higher stability in Hebrew TTS
Maintains naturalness and speaker similarity
Demonstrates adaptability to English
Abstract
This study introduces a refined approach to Text-to-Speech (TTS) generation that significantly enhances sampling stability across languages, with a particular focus on Hebrew. By leveraging discrete semantic units with higher phonetic correlation obtained from a self-supervised model, our method addresses the inherent instability often encountered in TTS systems, especially those dealing with non-diacriticized scripts like Hebrew. Utilizing HuBERT codes, our model generates discrete representations that are optimized for TTS tasks, thereby reducing the dependency on diacritic-based text processing. This advancement not only simplifies the language modeling process but also improves the robustness and shows controllability of the speech output due to disentenglement properties of the semantic units. The inclusion of a speaker embedding in the vocoder further aids in capturing the unique…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Natural Language Processing Techniques
