FNH-TTS: A Fast, Natural, and Human-Like Speech Synthesis System with advanced prosodic modeling based on Mixture of Experts

Qingliang Meng; Yuqing Deng; Wei Liang; Limei Yu; Huizhi Liang; Tian Li

arXiv:2508.12001·eess.AS·August 21, 2025

FNH-TTS: A Fast, Natural, and Human-Like Speech Synthesis System with advanced prosodic modeling based on Mixture of Experts

Qingliang Meng, Yuqing Deng, Wei Liang, Limei Yu, Huizhi Liang, Tian Li

PDF

Open Access

TL;DR

FNH-TTS is a novel speech synthesis system that combines advanced prosodic modeling with a fast, non-autoregressive architecture, achieving more natural, human-like speech with improved quality and efficiency.

Contribution

The paper introduces a new Duration Predictor based on Mixture of Experts and a multi-scale discriminator-based Vocoder, integrated into the VITS framework for enhanced prosody and synthesis quality.

Findings

01

Outperforms existing systems in synthesis quality and speed

02

Produces more natural duration predictions aligned with human speech

03

Achieves superior results in phoneme duration and Vocoder metrics

Abstract

Achieving natural and human-like speech synthesis with low inference costs remains a major challenge in speech synthesis research. This study focuses on human prosodic patterns and synthesized spectrum harmony, addressing the challenges of prosody modeling and artifact issues in non-autoregressive models. To enhance prosody modeling and synthesis quality, we introduce a new Duration Predictor based on the Mixture of Experts alongside a new Vocoder with two advanced multi-scale discriminators. We integrated the these new modules into the VITS system, forming our FNH-TTS system. Our experiments on LJSpeech, VCTK, and LibriTTS demonstrate the system's superiority in synthesis quality, phoneme duration prediction, Vocoder results, and synthesis speed. Our prosody visualization results show that FNH-TTS produces duration predictions that more closely align with natural human beings than…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems