Comparative Evaluation of Expressive Japanese Character Text-to-Speech with VITS and Style-BERT-VITS2
Zackary Rackauckas, Julia Hirschberg

TL;DR
This paper compares VITS and Style-BERT-VITS2 JP Extra for expressive Japanese speech synthesis, showing SBV2JE's near-human naturalness and its potential for language learning and dialogue applications.
Contribution
It provides an empirical evaluation of two models on Japanese speech, highlighting SBV2JE's effectiveness with pitch-accent control and discriminator enhancements.
Findings
SBV2JE matches human naturalness scores
SBV2JE achieves lower word error rate
SBV2JE is preferred in comparative evaluations
Abstract
Synthesizing expressive Japanese character speech poses unique challenges due to pitch-accent sensitivity and stylistic variability. This paper empirically evaluates two open-source text-to-speech models--VITS and Style-BERT-VITS2 JP Extra (SBV2JE)--on in-domain, character-driven Japanese speech. Using three character-specific datasets, we evaluate models across naturalness (mean opinion and comparative mean opinion score), intelligibility (word error rate), and speaker consistency. SBV2JE matches human ground truth in naturalness (MOS 4.37 vs. 4.38), achieves lower WER, and shows slight preference in CMOS. Enhanced by pitch-accent controls and a WavLM-based discriminator, SBV2JE proves effective for applications like language learning and character dialogue generation, despite higher computational demands.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research · Natural Language Processing Techniques
