Few-Shot Cross-Lingual TTS Using Transferable Phoneme Embedding

Wei-Ping Huang; Po-Chun Chen; Sung-Feng Huang; Hung-yi Lee

arXiv:2206.15427·eess.AS·August 4, 2022

Few-Shot Cross-Lingual TTS Using Transferable Phoneme Embedding

Wei-Ping Huang, Po-Chun Chen, Sung-Feng Huang, Hung-yi Lee

PDF

Open Access

TL;DR

This paper introduces a transferable phoneme embedding framework for cross-lingual TTS that enables high-quality speech synthesis in unseen languages with very limited data, leveraging a phoneme-based model and self-supervised features.

Contribution

The paper proposes a novel framework combining a phoneme-based TTS model and a codebook module for effective cross-lingual transfer in few-shot settings.

Findings

01

Achieves intelligible speech synthesis with only 4 utterances in unseen languages.

02

Utilizes phoneme-level self-supervised features to improve speech quality.

03

Naive transfer learning fails under extremely few-shot conditions, motivating the new approach.

Abstract

This paper studies a transferable phoneme embedding framework that aims to deal with the cross-lingual text-to-speech (TTS) problem under the few-shot setting. Transfer learning is a common approach when it comes to few-shot learning since training from scratch on few-shot training data is bound to overfit. Still, we find that the naive transfer learning approach fails to adapt to unseen languages under extremely few-shot settings, where less than 8 minutes of data is provided. We deal with the problem by proposing a framework that consists of a phoneme-based TTS model and a codebook module to project phonemes from different languages into a learned latent space. Furthermore, by utilizing phoneme-level averaged self-supervised learned features, we effectively improve the quality of synthesized speeches. Experiments show that using 4 utterances, which is about 30 seconds of data, is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Topic Modeling