Fast and small footprint Hybrid HMM-HiFiGAN based system for speech synthesis in Indian languages
Sudhanshu Srivastava, Ishika Gupta, Anusha Prakash, Jom Kuriakose,, Hema A. Murthy

TL;DR
This paper presents a hybrid HMM-HiFi-GAN speech synthesis system for Indian languages that combines traditional HMM-based feature generation with neural vocoding to achieve high-quality, natural-sounding speech with low computational requirements.
Contribution
It introduces a novel hybrid approach using HMMs and HiFi-GAN trained on high-resolution mel-spectrograms for improved speech synthesis quality in low-resource Indian languages.
Findings
Achieved naturalness comparable to end-to-end systems based on DMOS and PC tests.
Demonstrated the effectiveness of high-resolution mel-spectrograms in HMM-based synthesis.
Provided a computationally efficient system suitable for low-resource scenarios.
Abstract
Hidden-Markov-model (HMM) based text-to-speech (HTS) offers flexibility in speaking styles along with fast training and synthesis while being computationally less intense. HTS performs well even in low-resource scenarios. The primary drawback is that the voice quality is poor compared to that of E2E systems. A hybrid approach combining HMM-based feature generation and neural-network-based HiFi-GAN vocoder to improve HTS synthesis quality is proposed. HTS is trained on high-resolution mel-spectrograms instead of conventional mel generalized coefficients (MGC), and the output mel-spectrogram corresponding to the input text is used in a HiFi-GAN vocoder trained on Indic languages, to produce naturalness that is equivalent to that of E2E systems, as evidenced from the DMOS and PC tests.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Natural Language Processing Techniques
MethodsHiFi-GAN
