Exploring synthetic data for cross-speaker style transfer in style   representation based TTS

Lucas H. Ueda; Leonardo B. de M. M. Marques; Fl\'avio O. Sim\~oes,; M\'ario U. Neto; Fernando Runstein; Bianca Dal B\'o; Paula D. P. Costa

arXiv:2409.17364·eess.AS·September 27, 2024

Exploring synthetic data for cross-speaker style transfer in style representation based TTS

Lucas H. Ueda, Leonardo B. de M. M. Marques, Fl\'avio O. Sim\~oes,, M\'ario U. Neto, Fernando Runstein, Bianca Dal B\'o, Paula D. P. Costa

PDF

Open Access

TL;DR

This paper investigates using synthetic data generated by voice conversion models to improve cross-speaker and cross-language style transfer in text-to-speech systems, especially in low-resource settings.

Contribution

It introduces a novel approach of leveraging VC-generated synthetic data and pre-training techniques to enhance style transfer and accent transfer in TTS models.

Findings

01

Synthetic data improves naturalness and speaker similarity in TTS.

02

Pre-training style encoder reduces speaker leakage.

03

Method extends to cross-language accent transfer.

Abstract

Incorporating cross-speaker style transfer in text-to-speech (TTS) models is challenging due to the need to disentangle speaker and style information in audio. In low-resource expressive data scenarios, voice conversion (VC) can generate expressive speech for target speakers, which can then be used to train the TTS model. However, the quality and style transfer ability of the VC model are crucial for the overall TTS model quality. In this work, we explore the use of synthetic data generated by a VC model to assist the TTS model in cross-speaker style transfer tasks. Additionally, we employ pre-training of the style encoder using timbre perturbation and prototypical angular loss to mitigate speaker leakage. Our results show that using VC synthetic data can improve the naturalness and speaker similarity of TTS in cross-speaker scenarios. Furthermore, we extend this approach to a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing