Improving Cross-lingual Speech Synthesis with Triplet Training Scheme

Jianhao Ye; Hongbin Zhou; Zhiba Su; Wendi He; Kaimeng Ren; Lin Li,; Heng Lu

arXiv:2202.10729·cs.SD·February 23, 2022

Improving Cross-lingual Speech Synthesis with Triplet Training Scheme

Jianhao Ye, Hongbin Zhou, Zhiba Su, Wendi He, Kaimeng Ren, Lin Li,, Heng Lu

PDF

Open Access

TL;DR

This paper introduces a triplet training scheme for cross-lingual TTS that improves pronunciation naturalness and intelligibility by fine-tuning the model with triplet loss, effectively making synthesized speech sound more native.

Contribution

The paper proposes a novel triplet training scheme with an extra fine-tuning stage to enhance cross-lingual speech synthesis quality, addressing pronunciation gaps in existing systems.

Findings

01

Significant improvement in speech naturalness and intelligibility.

02

Effective adaptation to unseen content and speaker combinations.

03

Enhanced cross-lingual TTS performance demonstrated through evaluations.

Abstract

Recent advances in cross-lingual text-to-speech (TTS) made it possible to synthesize speech in a language foreign to a monolingual speaker. However, there is still a large gap between the pronunciation of generated cross-lingual speech and that of native speakers in terms of naturalness and intelligibility. In this paper, a triplet training scheme is proposed to enhance the cross-lingual pronunciation by allowing previously unseen content and speaker combinations to be seen during training. Proposed method introduces an extra fine-tune stage with triplet loss during training, which efficiently draws the pronunciation of the synthesized foreign speech closer to those from the native anchor speaker, while preserving the non-native speaker's timbre. Experiments are conducted based on a state-of-the-art baseline cross-lingual TTS system and its enhanced variants. All the objective and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research

MethodsTriplet Loss