High-Quality Text-to-Speech Implementation via Active Shallow Diffusion Mechanism

Junlin Deng; Ruihan Hou; Yan Deng; Yongqiu Long; Ning Wu

PMC · DOI:10.3390/s25030833·January 30, 2025

High-Quality Text-to-Speech Implementation via Active Shallow Diffusion Mechanism

Junlin Deng, Ruihan Hou, Yan Deng, Yongqiu Long, Ning Wu

PDF

Open Access

TL;DR

This paper introduces a new text-to-speech model that uses a two-stage process to generate high-quality speech quickly.

Contribution

The novel contribution is the cascaded model with an active shallow diffusion mechanism for fast and efficient text-to-speech synthesis.

Findings

01

The CMG-TTS model achieves high-quality speech with only one denoising step.

02

The model outperforms others in real-time performance metrics.

03

Both stages of the model are effective, as shown in ablation studies.

Abstract

Denoising diffusion probabilistic models (DDPMs) have proven to be useful in text-to-speech (TTS) tasks; however, it has been a challenge for traditional diffusion models to carry out real-time processing because of the need for hundreds of sampling steps during the iteration. In this work, a two-stage fast inference and efficient diffusion-based acoustic model of TTS, the Cascaded MixGAN-TTS (CMG-TTS), is proposed to address this problem. An active shallow diffusion mechanism is adopted to divide the CMG-TTS training process into two stages. Specifically, a basic acoustic model in the first stage is trained to provide valuable a priori knowledge for the second stage, and for the underlying acoustic modeling, a mixture combination mechanism-based linguistic encoder is introduced to work with pitch and energy predictors. In the following stage of processing, a post-net is used to…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species1

Homo sapiens(human · species)

Chemicals1

CMG

Diseases2

injury to people or property TTS

Figures6

Click any figure to enlarge with its caption.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing