SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models
Dongchao Yang, Rongjie Huang, Yuanyuan Wang, Haohan Guo, Dading Chong,, Songxiang Liu, Xixin Wu, Helen Meng

TL;DR
SimpleSpeech 2 introduces a simple, efficient non-autoregressive text-to-speech model that combines the strengths of AR and NAR methods, achieving high-quality, fast, and stable speech synthesis with easier data preparation.
Contribution
The paper presents a novel flow-based scalar latent transformer diffusion model and detailed analysis of speech tokenization and duration predictors, advancing large-scale TTS performance and efficiency.
Findings
Significant improvement in speech quality and speed over previous models
Effective extension to multilingual TTS datasets
Stable high-quality speech synthesis with simplified data and model design
Abstract
Scaling Text-to-speech (TTS) to large-scale datasets has been demonstrated as an effective method for improving the diversity and naturalness of synthesized speech. At the high level, previous large-scale TTS models can be categorized into either Auto-regressive (AR) based (\textit{e.g.}, VALL-E) or Non-auto-regressive (NAR) based models (\textit{e.g.}, NaturalSpeech 2/3). Although these works demonstrate good performance, they still have potential weaknesses. For instance, AR-based models are plagued by unstable generation quality and slow generation speed; meanwhile, some NAR-based models need phoneme-level duration alignment information, thereby increasing the complexity of data pre-processing, model design, and loss design. In this work, we build upon our previous publication by implementing a simple and efficient non-autoregressive (NAR) TTS framework, termed SimpleSpeech 2.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Topic Modeling
MethodsDiffusion · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
