End to End Bangla Speech Synthesis
Prithwiraj Bhattacharjee, Rajan Saha Raju, Arif Ahmad, and M. Shahidur, Rahman

TL;DR
This paper presents a deep learning-based end-to-end Bangla speech synthesis system that simplifies the process by eliminating the need for preprocessing and G2P conversion, achieving high naturalness with minimal data.
Contribution
The authors introduce a novel end-to-end Bangla TTS model with only three components, trained on 20 hours of data, outperforming existing systems in naturalness without complex preprocessing.
Findings
Achieved 3.79 MOS score indicating high naturalness.
Obtained 0.77 PESQ score demonstrating good speech quality.
Outperforms all existing non-commercial Bangla TTS systems.
Abstract
Text-to-Speech (TTS) system is a system where speech is synthesized from a given text following any particular approach. Concatenative synthesis, Hidden Markov Model (HMM) based synthesis, Deep Learning (DL) based synthesis with multiple building blocks, etc. are the main approaches for implementing a TTS system. Here, we are presenting our deep learning-based end-to-end Bangla speech synthesis system. It has been implemented with minimal human annotation using only 3 major components (Encoder, Decoder, Post-processing net including waveform synthesis). It does not require any frontend preprocessor and Grapheme-to-Phoneme (G2P) converter. Our model has been trained with phonetically balanced 20 hours of single speaker speech data. It has obtained a 3.79 Mean Opinion Score (MOS) on a scale of 5.0 as subjective evaluation and a 0.77 Perceptual Evaluation of Speech Quality(PESQ) score on a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
