Improving TTS for Shanghainese: Addressing Tone Sandhi via Word Segmentation
Yuanhao Chen

TL;DR
This paper improves Shanghainese TTS by incorporating word segmentation and syllable annotation to better model tone sandhi, addressing limitations of previous Mandarin-based approaches and enhancing speech naturalness.
Contribution
It introduces a novel preprocessing method using word segmentation and syllable annotation to better capture tone sandhi in Shanghainese TTS models.
Findings
Word segmentation improves tone sandhi accuracy in TTS.
Syllable annotation serves as a proxy for prosodic information.
Prosodic annotation can model dynamic tonal phenomena.
Abstract
Tone is a crucial component of the prosody of Shanghainese, a Wu Chinese variety spoken primarily in urban Shanghai. Tone sandhi, which applies to all multi-syllabic words in Shanghainese, then, is key to natural-sounding speech. Unfortunately, recent work on Shanghainese TTS (text-to-speech) such as Apple's VoiceOver has shown poor performance with tone sandhi, especially LD (left-dominant sandhi). Here I show that word segmentation during text preprocessing can improve the quality of tone sandhi production in TTS models. Syllables within the same word are annotated with a special symbol, which serves as a proxy for prosodic information of the domain of LD. Contrary to the common practice of using prosodic annotation mainly for static pauses, this paper demonstrates that prosodic annotation can also be applied to dynamic tonal phenomena. I anticipate this project to be a starting point…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPhonetics and Phonology Research · Speech Recognition and Synthesis · Natural Language Processing Techniques
