# High-Quality Text-to-Speech Implementation via Active Shallow Diffusion Mechanism

**Authors:** Junlin Deng, Ruihan Hou, Yan Deng, Yongqiu Long, Ning Wu

PMC · DOI: 10.3390/s25030833 · 2025-01-30

## TL;DR

This paper introduces a new text-to-speech model that uses a two-stage process to generate high-quality speech quickly.

## Contribution

The novel contribution is the cascaded model with an active shallow diffusion mechanism for fast and efficient text-to-speech synthesis.

## Key findings

- The CMG-TTS model achieves high-quality speech with only one denoising step.
- The model outperforms others in real-time performance metrics.
- Both stages of the model are effective, as shown in ablation studies.

## Abstract

Denoising diffusion probabilistic models (DDPMs) have proven to be useful in text-to-speech (TTS) tasks; however, it has been a challenge for traditional diffusion models to carry out real-time processing because of the need for hundreds of sampling steps during the iteration. In this work, a two-stage fast inference and efficient diffusion-based acoustic model of TTS, the Cascaded MixGAN-TTS (CMG-TTS), is proposed to address this problem. An active shallow diffusion mechanism is adopted to divide the CMG-TTS training process into two stages. Specifically, a basic acoustic model in the first stage is trained to provide valuable a priori knowledge for the second stage, and for the underlying acoustic modeling, a mixture combination mechanism-based linguistic encoder is introduced to work with pitch and energy predictors. In the following stage of processing, a post-net is used to optimize the mel-spectrogram reconstruction performance. The CMG-TTS is evaluated on datasets such as the AISHELL3 and LJSpeech, and the experiments show that the CMG-TTS achieves satisfactory results in both subjective and objective evaluation metrics with only one denoising step. Compared to other TTS models based on diffusion modeling, the CMG-TTS obtains a leading score in the real time factor (RTF), and both stages of the CMG-TTS are effective in the ablation studies.

## Full-text entities

- **Diseases:** injury to people or property (MESH:C000719191), TTS (MESH:D013064)
- **Chemicals:** CMG (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/PMC11819741/full.md

---
Source: https://tomesphere.com/paper/PMC11819741