Mixed-Phoneme BERT: Improving BERT with Mixed Phoneme and Sup-Phoneme Representations for Text to Speech
Guangyan Zhang, Kaitao Song, Xu Tan, Daxin Tan, Yuzi Yan, Yanqing Liu,, Gang Wang, Wei Zhou, Tao Qin, Tan Lee, Sheng Zhao

TL;DR
This paper introduces Mixed-Phoneme BERT, a novel model that combines phoneme and sup-phoneme representations to improve TTS performance, achieving better quality and efficiency over existing models.
Contribution
It proposes a new mixed representation approach for BERT in TTS, enhancing contextual learning and model capacity for better synthesis quality.
Findings
Significant 0.30 CMOS improvement over baseline
Achieves 3x inference speedup
Maintains voice quality comparable to PnG BERT
Abstract
Recently, leveraging BERT pre-training to improve the phoneme encoder in text to speech (TTS) has drawn increasing attention. However, the works apply pre-training with character-based units to enhance the TTS phoneme encoder, which is inconsistent with the TTS fine-tuning that takes phonemes as input. Pre-training only with phonemes as input can alleviate the input mismatch but lack the ability to model rich representations and semantic information due to limited phoneme vocabulary. In this paper, we propose MixedPhoneme BERT, a novel variant of the BERT model that uses mixed phoneme and sup-phoneme representations to enhance the learning capability. Specifically, we merge the adjacent phonemes into sup-phonemes and combine the phoneme sequence and the merged sup-phoneme sequence as the model input, which can enhance the model capacity to learn rich contextual representations.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · *Communicated@Fast*How Do I Communicate to Expedia? · WordPiece · Weight Decay · Position-Wise Feed-Forward Layer · Dense Connections · Attention Dropout · Multi-Head Attention · Linear Warmup With Linear Decay
