Using multiple reference audios and style embedding constraints for speech synthesis
Cheng Gong, Longbiao Wang, Zhenhua Ling, Ju Zhang, Jianwu Dang

TL;DR
This paper introduces a novel speech synthesis approach that utilizes multiple reference audios and style embedding constraints to enhance naturalness and style accuracy, addressing limitations of manual embedding selection and mismatched training-inference data.
Contribution
The study proposes a method combining multiple automatically selected reference audios and style embedding constraints to improve speech synthesis quality and style consistency.
Findings
Improved speech naturalness and content quality with multiple reference audios.
Outperformed baseline in style similarity preference tests.
Enhanced robustness against mismatched text and speech during inference.
Abstract
The end-to-end speech synthesis model can directly take an utterance as reference audio, and generate speech from the text with prosody and speaker characteristics similar to the reference audio. However, an appropriate acoustic embedding must be manually selected during inference. Due to the fact that only the matched text and speech are used in the training process, using unmatched text and speech for inference would cause the model to synthesize speech with low content quality. In this study, we propose to mitigate these two problems by using multiple reference audios and style embedding constraints rather than using only the target audio. Multiple reference audios are automatically selected using the sentence similarity determined by Bidirectional Encoder Representations from Transformers (BERT). In addition, we use ''target'' style embedding from a Pre-trained encoder as a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
