Fast and small footprint Hybrid HMM-HiFiGAN based system for speech   synthesis in Indian languages

Sudhanshu Srivastava; Ishika Gupta; Anusha Prakash; Jom Kuriakose,; Hema A. Murthy

arXiv:2302.06227·eess.AS·February 14, 2023

Fast and small footprint Hybrid HMM-HiFiGAN based system for speech synthesis in Indian languages

Sudhanshu Srivastava, Ishika Gupta, Anusha Prakash, Jom Kuriakose,, Hema A. Murthy

PDF

Open Access

TL;DR

This paper presents a hybrid HMM-HiFi-GAN speech synthesis system for Indian languages that combines traditional HMM-based feature generation with neural vocoding to achieve high-quality, natural-sounding speech with low computational requirements.

Contribution

It introduces a novel hybrid approach using HMMs and HiFi-GAN trained on high-resolution mel-spectrograms for improved speech synthesis quality in low-resource Indian languages.

Findings

01

Achieved naturalness comparable to end-to-end systems based on DMOS and PC tests.

02

Demonstrated the effectiveness of high-resolution mel-spectrograms in HMM-based synthesis.

03

Provided a computationally efficient system suitable for low-resource scenarios.

Abstract

Hidden-Markov-model (HMM) based text-to-speech (HTS) offers flexibility in speaking styles along with fast training and synthesis while being computationally less intense. HTS performs well even in low-resource scenarios. The primary drawback is that the voice quality is poor compared to that of E2E systems. A hybrid approach combining HMM-based feature generation and neural-network-based HiFi-GAN vocoder to improve HTS synthesis quality is proposed. HTS is trained on high-resolution mel-spectrograms instead of conventional mel generalized coefficients (MGC), and the output mel-spectrogram corresponding to the input text is used in a HiFi-GAN vocoder trained on Indic languages, to produce naturalness that is equivalent to that of E2E systems, as evidenced from the DMOS and PC tests.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Natural Language Processing Techniques

MethodsHiFi-GAN