Improving Cross-lingual Speech Synthesis with Triplet Training Scheme
Jianhao Ye, Hongbin Zhou, Zhiba Su, Wendi He, Kaimeng Ren, Lin Li,, Heng Lu

TL;DR
This paper introduces a triplet training scheme for cross-lingual TTS that improves pronunciation naturalness and intelligibility by fine-tuning the model with triplet loss, effectively making synthesized speech sound more native.
Contribution
The paper proposes a novel triplet training scheme with an extra fine-tuning stage to enhance cross-lingual speech synthesis quality, addressing pronunciation gaps in existing systems.
Findings
Significant improvement in speech naturalness and intelligibility.
Effective adaptation to unseen content and speaker combinations.
Enhanced cross-lingual TTS performance demonstrated through evaluations.
Abstract
Recent advances in cross-lingual text-to-speech (TTS) made it possible to synthesize speech in a language foreign to a monolingual speaker. However, there is still a large gap between the pronunciation of generated cross-lingual speech and that of native speakers in terms of naturalness and intelligibility. In this paper, a triplet training scheme is proposed to enhance the cross-lingual pronunciation by allowing previously unseen content and speaker combinations to be seen during training. Proposed method introduces an extra fine-tune stage with triplet loss during training, which efficiently draws the pronunciation of the synthesized foreign speech closer to those from the native anchor speaker, while preserving the non-native speaker's timbre. Experiments are conducted based on a state-of-the-art baseline cross-lingual TTS system and its enhanced variants. All the objective and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research
MethodsTriplet Loss
