Cross-Dialect Text-To-Speech in Pitch-Accent Language Incorporating Multi-Dialect Phoneme-Level BERT
Kazuki Yamauchi, Yuki Saito, Hiroshi Saruwatari

TL;DR
This paper introduces a novel cross-dialect TTS model that leverages multi-dialect phoneme-level BERT and phoneme-level accent latent variables to synthesize natural speech across dialects, especially in pitch-accent languages.
Contribution
The paper proposes a new TTS framework with three modules, including a phoneme-level BERT-based ALV predictor, to improve cross-dialect speech synthesis quality.
Findings
Enhanced dialectal naturalness in synthetic speech.
Outperforms conventional dialect TTS methods.
Effective prediction of dialect-specific accent features.
Abstract
We explore cross-dialect text-to-speech (CD-TTS), a task to synthesize learned speakers' voices in non-native dialects, especially in pitch-accent languages. CD-TTS is important for developing voice agents that naturally communicate with people across regions. We present a novel TTS model comprising three sub-modules to perform competitively at this task. We first train a backbone TTS model to synthesize dialect speech from a text conditioned on phoneme-level accent latent variables (ALVs) extracted from speech by a reference encoder. Then, we train an ALV predictor to predict ALVs tailored to a target dialect from input text leveraging our novel multi-dialect phoneme-level BERT. We conduct multi-dialect TTS experiments and evaluate the effectiveness of our model by comparing it with a baseline derived from conventional dialect TTS methods. The results show that our model improves the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Softmax · Layer Normalization · Dropout · Attention Is All You Need · WordPiece · Residual Connection · Attention Dropout · Linear Layer · Multi-Head Attention
