VITS2: Improving Quality and Efficiency of Single-Stage Text-to-Speech   with Adversarial Learning and Architecture Design

Jungil Kong; Jihoon Park; Beomjeong Kim; Jeongmin Kim; Dohee Kong,; Sangjin Kim

arXiv:2307.16430·cs.SD·August 1, 2023

VITS2: Improving Quality and Efficiency of Single-Stage Text-to-Speech with Adversarial Learning and Architecture Design

Jungil Kong, Jihoon Park, Beomjeong Kim, Jeongmin Kim, Dohee Kong,, Sangjin Kim

PDF

Open Access 3 Repos 2 Models

TL;DR

VITS2 is a novel single-stage text-to-speech model that enhances speech naturalness, efficiency, and reduces phoneme conversion dependence through improved architecture and training strategies.

Contribution

The paper introduces VITS2, a new single-stage TTS model with improved naturalness, efficiency, and end-to-end capability, surpassing previous models.

Findings

01

Enhanced speech naturalness and similarity in multi-speaker settings.

02

Reduced training and inference time.

03

Significantly decreased reliance on phoneme conversion.

Abstract

Single-stage text-to-speech models have been actively studied recently, and their results have outperformed two-stage pipeline systems. Although the previous single-stage model has made great progress, there is room for improvement in terms of its intermittent unnaturalness, computational efficiency, and strong dependence on phoneme conversion. In this work, we introduce VITS2, a single-stage text-to-speech model that efficiently synthesizes a more natural speech by improving several aspects of the previous work. We propose improved structures and training mechanisms and present that the proposed methods are effective in improving naturalness, similarity of speech characteristics in a multi-speaker model, and efficiency of training and inference. Furthermore, we demonstrate that the strong dependence on phoneme conversion in previous works can be significantly reduced with our method,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Speech and dialogue systems