Bahasa Harmony: A Comprehensive Dataset for Bahasa Text-to-Speech Synthesis with Discrete Codec Modeling of EnGen-TTS
Onkar Kishor Susladkar, Vishesh Tripathi, Biddwan Ahmed

TL;DR
This paper presents a new comprehensive Bahasa TTS dataset and a novel EnGen-TTS model that significantly improves speech synthesis quality and efficiency for the Bahasa language.
Contribution
Introduces a large, diverse Bahasa TTS dataset and a novel EnGen-TTS model with superior performance and efficiency over existing baselines.
Findings
EnGen-TTS achieves a MOS of 4.45 ± 0.13.
The dataset contains approximately 55 hours and 52K audio samples.
EnGen-TTS demonstrates efficient real-time performance.
Abstract
This research introduces a comprehensive Bahasa text-to-speech (TTS) dataset and a novel TTS model, EnGen-TTS, designed to enhance the quality and versatility of synthetic speech in the Bahasa language. The dataset, spanning \textasciitilde55.0 hours and 52K audio recordings, integrates diverse textual sources, ensuring linguistic richness. A meticulous recording setup captures the nuances of Bahasa phonetics, employing professional equipment to ensure high-fidelity audio samples. Statistical analysis reveals the dataset's scale and diversity, laying the foundation for model training and evaluation. The proposed EnGen-TTS model performs better than established baselines, achieving a Mean Opinion Score (MOS) of 4.45 0.13. Additionally, our investigation on real-time factor and model size highlights EnGen-TTS as a compelling choice, with efficient performance. This research marks a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEdcuational Technology Systems
