Latent Filling: Latent Space Data Augmentation for Zero-shot Speech   Synthesis

Jae-Sung Bae; Joun Yeop Lee; Ji-Hyun Lee; Seongkyu Mun; Taehwa Kang,; Hoon-Young Cho; Chanwoo Kim

arXiv:2310.03538·eess.AS·January 23, 2024·ICASSP

Latent Filling: Latent Space Data Augmentation for Zero-shot Speech Synthesis

Jae-Sung Bae, Joun Yeop Lee, Ji-Hyun Lee, Seongkyu Mun, Taehwa Kang,, Hoon-Young Cho, Chanwoo Kim

PDF

Open Access

TL;DR

This paper introduces a latent space data augmentation method called Latent Filling (LF) for zero-shot speech synthesis, which improves speaker similarity without degrading speech quality by augmenting in the speaker embedding space.

Contribution

The paper proposes a novel latent space data augmentation technique for ZS-TTS that enhances speaker similarity without additional training stages.

Findings

01

LF significantly improves speaker similarity.

02

LF preserves speech quality.

03

Seamless integration into existing ZS-TTS systems.

Abstract

Previous works in zero-shot text-to-speech (ZS-TTS) have attempted to enhance its systems by enlarging the training data through crowd-sourcing or augmenting existing speech data. However, the use of low-quality data has led to a decline in the overall system performance. To avoid such degradation, instead of directly augmenting the input data, we propose a latent filling (LF) method that adopts simple but effective latent space data augmentation in the speaker embedding space of the ZS-TTS system. By incorporating a consistency loss, LF can be seamlessly integrated into existing ZS-TTS systems without the need for additional training stages. Experimental results show that LF significantly improves speaker similarity while preserving speech quality.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing