FNH-TTS: A Fast, Natural, and Human-Like Speech Synthesis System with advanced prosodic modeling based on Mixture of Experts
Qingliang Meng, Yuqing Deng, Wei Liang, Limei Yu, Huizhi Liang, Tian Li

TL;DR
FNH-TTS is a novel speech synthesis system that combines advanced prosodic modeling with a fast, non-autoregressive architecture, achieving more natural, human-like speech with improved quality and efficiency.
Contribution
The paper introduces a new Duration Predictor based on Mixture of Experts and a multi-scale discriminator-based Vocoder, integrated into the VITS framework for enhanced prosody and synthesis quality.
Findings
Outperforms existing systems in synthesis quality and speed
Produces more natural duration predictions aligned with human speech
Achieves superior results in phoneme duration and Vocoder metrics
Abstract
Achieving natural and human-like speech synthesis with low inference costs remains a major challenge in speech synthesis research. This study focuses on human prosodic patterns and synthesized spectrum harmony, addressing the challenges of prosody modeling and artifact issues in non-autoregressive models. To enhance prosody modeling and synthesis quality, we introduce a new Duration Predictor based on the Mixture of Experts alongside a new Vocoder with two advanced multi-scale discriminators. We integrated the these new modules into the VITS system, forming our FNH-TTS system. Our experiments on LJSpeech, VCTK, and LibriTTS demonstrate the system's superiority in synthesis quality, phoneme duration prediction, Vocoder results, and synthesis speed. Our prosody visualization results show that FNH-TTS produces duration predictions that more closely align with natural human beings than…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and dialogue systems
