StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
Yinghao Aaron Li, Cong Han, Vinay S. Raghavan, Gavin Mischler, Nima, Mesgarani

TL;DR
StyleTTS 2 introduces a novel TTS approach using style diffusion and adversarial training with large speech models, achieving human-level naturalness and zero-shot speaker adaptation across multiple datasets.
Contribution
It is the first to combine style diffusion and adversarial training with large speech models for human-level TTS.
Findings
Surpasses human recordings on LJSpeech dataset.
Matches human recordings on VCTK dataset.
Outperforms previous models in zero-shot speaker adaptation.
Abstract
In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level TTS synthesis. StyleTTS 2 differs from its predecessor by modeling styles as a latent random variable through diffusion models to generate the most suitable style for the text without requiring reference speech, achieving efficient latent diffusion while benefiting from the diverse speech synthesis offered by diffusion models. Furthermore, we employ large pre-trained SLMs, such as WavLM, as discriminators with our novel differentiable duration modeling for end-to-end training, resulting in improved speech naturalness. StyleTTS 2 surpasses human recordings on the single-speaker LJSpeech dataset and matches it on the multispeaker VCTK dataset as judged by native English speakers. Moreover, when trained on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗hexgrad/Kokoro-82Mmodel· 9.6M dl· ♡ 58669.6M dl♡ 5866
- 🤗ShoukanLabs/Vokanmodel· ♡ 76♡ 76
- 🤗mrfakename/styletts2-detectormodel· 5 dl· ♡ 35 dl♡ 3
- 🤗h3110Fr13nd/styletts2-spanish-maxlen-100-epoch-4model
- 🤗geneing/Kokoromodel· 7 dl· ♡ 167 dl♡ 16
- 🤗MaziyarPanahi/Kokoro-82Mmodel· 1 dl· ♡ 51 dl♡ 5
- 🤗AliceJohnson/Darwin-AImodel· 2 dl2 dl
- 🤗ctranslate2-4you/Kokoro-82M-lightmodel· 9 dl· ♡ 99 dl♡ 9
- 🤗prince-canuma/Kokoro-82Mmodel· 723 dl· ♡ 5723 dl♡ 5
- 🤗prince-canuma/Kokoro-82M-4bitmodel· 15 dl15 dl
Videos
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling
MethodsDiffusion
