Comparative Evaluation of Expressive Japanese Character Text-to-Speech with VITS and Style-BERT-VITS2

Zackary Rackauckas; Julia Hirschberg

arXiv:2505.17320·cs.CL·December 2, 2025

Comparative Evaluation of Expressive Japanese Character Text-to-Speech with VITS and Style-BERT-VITS2

Zackary Rackauckas, Julia Hirschberg

PDF

Open Access

TL;DR

This paper compares VITS and Style-BERT-VITS2 JP Extra for expressive Japanese speech synthesis, showing SBV2JE's near-human naturalness and its potential for language learning and dialogue applications.

Contribution

It provides an empirical evaluation of two models on Japanese speech, highlighting SBV2JE's effectiveness with pitch-accent control and discriminator enhancements.

Findings

01

SBV2JE matches human naturalness scores

02

SBV2JE achieves lower word error rate

03

SBV2JE is preferred in comparative evaluations

Abstract

Synthesizing expressive Japanese character speech poses unique challenges due to pitch-accent sensitivity and stylistic variability. This paper empirically evaluates two open-source text-to-speech models--VITS and Style-BERT-VITS2 JP Extra (SBV2JE)--on in-domain, character-driven Japanese speech. Using three character-specific datasets, we evaluate models across naturalness (mean opinion and comparative mean opinion score), intelligibility (word error rate), and speaker consistency. SBV2JE matches human ground truth in naturalness (MOS 4.37 vs. 4.38), achieves lower WER, and shows slight preference in CMOS. Enhanced by pitch-accent controls and a WavLM-based discriminator, SBV2JE proves effective for applications like language learning and character dialogue generation, despite higher computational demands.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research · Natural Language Processing Techniques