Exploring synthetic data for cross-speaker style transfer in style representation based TTS
Lucas H. Ueda, Leonardo B. de M. M. Marques, Fl\'avio O. Sim\~oes,, M\'ario U. Neto, Fernando Runstein, Bianca Dal B\'o, Paula D. P. Costa

TL;DR
This paper investigates using synthetic data generated by voice conversion models to improve cross-speaker and cross-language style transfer in text-to-speech systems, especially in low-resource settings.
Contribution
It introduces a novel approach of leveraging VC-generated synthetic data and pre-training techniques to enhance style transfer and accent transfer in TTS models.
Findings
Synthetic data improves naturalness and speaker similarity in TTS.
Pre-training style encoder reduces speaker leakage.
Method extends to cross-language accent transfer.
Abstract
Incorporating cross-speaker style transfer in text-to-speech (TTS) models is challenging due to the need to disentangle speaker and style information in audio. In low-resource expressive data scenarios, voice conversion (VC) can generate expressive speech for target speakers, which can then be used to train the TTS model. However, the quality and style transfer ability of the VC model are crucial for the overall TTS model quality. In this work, we explore the use of synthetic data generated by a VC model to assist the TTS model in cross-speaker style transfer tasks. Additionally, we employ pre-training of the style encoder using timbre perturbation and prototypical angular loss to mitigate speaker leakage. Our results show that using VC synthetic data can improve the naturalness and speaker similarity of TTS in cross-speaker scenarios. Furthermore, we extend this approach to a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
