TL;DR
JaiTTS-v1.0 is a Thai voice cloning TTS model that handles numerals and code-switching without normalization, achieving state-of-the-art accuracy and outperforming commercial systems in human evaluations.
Contribution
The paper introduces JaiTTS-v1.0, a novel Thai TTS model that processes numerals and code-switching directly, with superior performance and publicly available code.
Findings
Achieves CER of 1.94%, surpassing human ground truth of 1.98%.
Outperforms commercial TTS systems in human preference tests.
Handles numerals and code-switching without explicit normalization.
Abstract
We present JaiTTS-v1.0, a state-of-the-art Thai voice cloning text-to-speech model built through continual training on a large Thai-centric speech corpus. The model architecture is adapted from VoxCPM, a tokenizer-free autoregressive TTS model. JaiTTS-v1.0 directly processes numerals and Thai-English code-switching, which is very common in realistic settings, without explicit text normalization. We test the models on short- and long-duration speech generation, which reflects many real-world use cases. JaiTTS-v1.0 achieves a state-of-the-art CER of 1.94%, surpassing the human ground truth of 1.98% for short-duration tasks while performing on par with human ground truth for long-duration tasks. In human judgment evaluations, our model wins 283 of 400 pairwise comparisons against commercial flagships, with only 58 losses. Our code and demo are available at…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
