SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with   Flow-based Scalar Latent Transformer Diffusion Models

Dongchao Yang; Rongjie Huang; Yuanyuan Wang; Haohan Guo; Dading Chong,; Songxiang Liu; Xixin Wu; Helen Meng

arXiv:2408.13893·cs.SD·August 29, 2024

SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models

Dongchao Yang, Rongjie Huang, Yuanyuan Wang, Haohan Guo, Dading Chong,, Songxiang Liu, Xixin Wu, Helen Meng

PDF

Open Access

TL;DR

SimpleSpeech 2 introduces a simple, efficient non-autoregressive text-to-speech model that combines the strengths of AR and NAR methods, achieving high-quality, fast, and stable speech synthesis with easier data preparation.

Contribution

The paper presents a novel flow-based scalar latent transformer diffusion model and detailed analysis of speech tokenization and duration predictors, advancing large-scale TTS performance and efficiency.

Findings

01

Significant improvement in speech quality and speed over previous models

02

Effective extension to multilingual TTS datasets

03

Stable high-quality speech synthesis with simplified data and model design

Abstract

Scaling Text-to-speech (TTS) to large-scale datasets has been demonstrated as an effective method for improving the diversity and naturalness of synthesized speech. At the high level, previous large-scale TTS models can be categorized into either Auto-regressive (AR) based (\textit{e.g.}, VALL-E) or Non-auto-regressive (NAR) based models (\textit{e.g.}, NaturalSpeech 2/3). Although these works demonstrate good performance, they still have potential weaknesses. For instance, AR-based models are plagued by unstable generation quality and slow generation speed; meanwhile, some NAR-based models need phoneme-level duration alignment information, thereby increasing the complexity of data pre-processing, model design, and loss design. In this work, we build upon our previous publication by implementing a simple and efficient non-autoregressive (NAR) TTS framework, termed SimpleSpeech 2.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Topic Modeling

MethodsDiffusion · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings