XPhoneBERT: A Pre-trained Multilingual Model for Phoneme Representations for Text-to-Speech
Linh The Nguyen, Thinh Pham, Dat Quoc Nguyen

TL;DR
XPhoneBERT is a multilingual pre-trained model for phoneme representations that enhances text-to-speech systems across nearly 100 languages, improving naturalness and prosody, especially with limited data.
Contribution
It introduces the first multilingual phoneme representation model pre-trained on extensive data, tailored for TTS applications, advancing multilingual speech synthesis research.
Findings
Significantly improves TTS naturalness and prosody.
Effective with limited training data.
Supports nearly 100 languages.
Abstract
We present XPhoneBERT, the first multilingual model pre-trained to learn phoneme representations for the downstream text-to-speech (TTS) task. Our XPhoneBERT has the same model architecture as BERT-base, trained using the RoBERTa pre-training approach on 330M phoneme-level sentences from nearly 100 languages and locales. Experimental results show that employing XPhoneBERT as an input phoneme encoder significantly boosts the performance of a strong neural TTS model in terms of naturalness and prosody and also helps produce fairly high-quality speech with limited training data. We publicly release our pre-trained XPhoneBERT with the hope that it would facilitate future research and downstream TTS applications for multiple languages. Our XPhoneBERT model is available at https://github.com/VinAIResearch/XPhoneBERT
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Speech and dialogue systems
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Attention Dropout · Linear Warmup With Linear Decay · Residual Connection · Linear Layer · Layer Normalization · Softmax · Adam
