Cross-Lingual Text-to-Speech Using Multi-Task Learning and Speaker Classifier Joint Training
J. Yang, Lei He

TL;DR
This paper proposes a multi-task learning framework with speaker classifier joint training to enhance cross-lingual speaker similarity in text-to-speech synthesis, effectively improving quality for both seen and unseen speakers.
Contribution
It introduces a novel multi-task learning approach combined with joint training and scheduled sampling to improve cross-lingual speaker similarity in TTS models.
Findings
Improved cross-lingual speaker similarity in subjective evaluations.
Enhanced objective metrics for speaker similarity.
Effective for both seen and unseen speakers.
Abstract
In cross-lingual speech synthesis, the speech in various languages can be synthesized for a monoglot speaker. Normally, only the data of monoglot speakers are available for model training, thus the speaker similarity is relatively low between the synthesized cross-lingual speech and the native language recordings. Based on the multilingual transformer text-to-speech model, this paper studies a multi-task learning framework to improve the cross-lingual speaker similarity. To further improve the speaker similarity, joint training with a speaker classifier is proposed. Here, a scheme similar to parallel scheduled sampling is proposed to train the transformer model efficiently to avoid breaking the parallel training mechanism when introducing joint training. By using multi-task learning and speaker classifier joint training, in subjective and objective evaluations, the cross-lingual speaker…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques
