Probing the phonetic and phonological knowledge of tones in Mandarin TTS models
Jian Zhu

TL;DR
This paper investigates how well Mandarin TTS models understand and reproduce tonal phonetic and phonological phenomena, revealing strengths in surface patterns but limitations in applying complex tone rules, and suggesting improvements through linguistic stimuli.
Contribution
It demonstrates that TTS models can capture some linguistic phenomena but struggle with complex tone sandhi, and shows that incorporating BERT embeddings enhances prosody and generalization.
Findings
TTS models capture surface tonal coarticulation well.
Models struggle with Tone-3 sandhi in novel sentences.
BERT embeddings improve naturalness and tone rule generalization.
Abstract
This study probes the phonetic and phonological knowledge of lexical tones in TTS models through two experiments. Controlled stimuli for testing tonal coarticulation and tone sandhi in Mandarin were fed into Tacotron 2 and WaveGlow to generate speech samples, which were subject to acoustic analysis and human evaluation. Results show that both baseline Tacotron 2 and Tacotron 2 with BERT embeddings capture the surface tonal coarticulation patterns well but fail to consistently apply the Tone-3 sandhi rule to novel sentences. Incorporating pre-trained BERT embeddings into Tacotron 2 improves the naturalness and prosody performance, and yields better generalization of Tone-3 sandhi rules to novel complex sentences, although the overall accuracy for Tone-3 sandhi was still low. Given that TTS models do capture some linguistic phenomena, it is argued that they can be used to generate and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research · Natural Language Processing Techniques
MethodsDilated Causal Convolution · Zoneout · Long Short-Term Memory · WaveNet · Mixture of Logistic Distributions · Location Sensitive Attention · Bidirectional LSTM · Linear Layer · Tacotron2 · Affine Coupling
