Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation

Dongchan Min; Dong Bok Lee; Eunho Yang; Sung Ju Hwang

arXiv:2106.03153·eess.AS·June 17, 2021·45 cites

Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation

Dongchan Min, Dong Bok Lee, Eunho Yang, Sung Ju Hwang

PDF

Open Access 2 Repos 1 Video

TL;DR

Meta-StyleSpeech is a novel multi-speaker TTS model that adapts to new speakers using style normalization and meta-learning, enabling high-quality speech synthesis from minimal reference audio.

Contribution

It introduces Style-Adaptive Layer Normalization and episodic training with discriminators, advancing speaker adaptation in TTS without fine-tuning.

Findings

01

High-quality speech synthesis from 1-3 sec audio clips

02

Significant improvement over baseline methods

03

Effective speaker style transfer with minimal data

Abstract

With rapid progress in neural text-to-speech (TTS) models, personalized speech generation is now in high demand for many applications. For practical applicability, a TTS model should generate high-quality speech with only a few audio samples from the given speaker, that are also short in length. However, existing methods either require to fine-tune the model or achieve low adaptation quality without fine-tuning. In this work, we propose StyleSpeech, a new TTS model which not only synthesizes high-quality speech but also effectively adapts to new speakers. Specifically, we propose Style-Adaptive Layer Normalization (SALN) which aligns gain and bias of the text input according to the style extracted from a reference speech audio. With SALN, our model effectively synthesizes speech in the style of the target speaker even from single speech audio. Furthermore, to enhance StyleSpeech's…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation· slideslive

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Topic Modeling

MethodsLayer Normalization