VoxHakka: A Dialectally Diverse Multi-speaker Text-to-Speech System for Taiwanese Hakka
Li-Wei Chen, Hung-Shin Lee, Chen-Chi Chang

TL;DR
VoxHakka is a multi-dialect Taiwanese Hakka TTS system that achieves high naturalness and accuracy by utilizing dialect-specific data, innovative data collection, and ASR-based cleaning, outperforming existing systems.
Contribution
This paper presents VoxHakka, the first high-quality, multi-dialect Hakka TTS system trained on a novel dataset created through web scraping and ASR-based data cleaning techniques.
Findings
VoxHakka outperforms existing Hakka TTS systems in naturalness and pronunciation accuracy.
The system supports six Hakka dialects with high speaker awareness.
The dataset and methods facilitate resource-efficient Hakka speech synthesis.
Abstract
This paper introduces VoxHakka, a text-to-speech (TTS) system designed for Taiwanese Hakka, a critically under-resourced language spoken in Taiwan. Leveraging the YourTTS framework, VoxHakka achieves high naturalness and accuracy and low real-time factor in speech synthesis while supporting six distinct Hakka dialects. This is achieved by training the model with dialect-specific data, allowing for the generation of speaker-aware Hakka speech. To address the scarcity of publicly available Hakka speech corpora, we employed a cost-effective approach utilizing a web scraping pipeline coupled with automatic speech recognition (ASR)-based data cleaning techniques. This process ensured the acquisition of a high-quality, multi-speaker, multi-dialect dataset suitable for TTS training. Subjective listening tests conducted using comparative mean opinion scores (CMOS) demonstrate that VoxHakka…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech and dialogue systems
