Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation
Dongchan Min, Dong Bok Lee, Eunho Yang, Sung Ju Hwang

TL;DR
Meta-StyleSpeech is a novel multi-speaker TTS model that adapts to new speakers using style normalization and meta-learning, enabling high-quality speech synthesis from minimal reference audio.
Contribution
It introduces Style-Adaptive Layer Normalization and episodic training with discriminators, advancing speaker adaptation in TTS without fine-tuning.
Findings
High-quality speech synthesis from 1-3 sec audio clips
Significant improvement over baseline methods
Effective speaker style transfer with minimal data
Abstract
With rapid progress in neural text-to-speech (TTS) models, personalized speech generation is now in high demand for many applications. For practical applicability, a TTS model should generate high-quality speech with only a few audio samples from the given speaker, that are also short in length. However, existing methods either require to fine-tune the model or achieve low adaptation quality without fine-tuning. In this work, we propose StyleSpeech, a new TTS model which not only synthesizes high-quality speech but also effectively adapts to new speakers. Specifically, we propose Style-Adaptive Layer Normalization (SALN) which aligns gain and bias of the text input according to the style extracted from a reference speech audio. With SALN, our model effectively synthesizes speech in the style of the target speaker even from single speech audio. Furthermore, to enhance StyleSpeech's…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Topic Modeling
MethodsLayer Normalization
