BreezyVoice: Adapting TTS for Taiwanese Mandarin with Enhanced Polyphone   Disambiguation -- Challenges and Insights

Chan-Jan Hsu; Yi-Cheng Lin; Chia-Chun Lin; Wei-Chih Chen; Ho Lam; Chung; Chen-An Li; Yi-Chang Chen; Chien-Yu Yu; Ming-Ji Lee; Chien-Cheng Chen,; Ru-Heng Huang; Hung-yi Lee; Da-Shan Shiu

arXiv:2501.17790·cs.CL·January 30, 2025

BreezyVoice: Adapting TTS for Taiwanese Mandarin with Enhanced Polyphone Disambiguation -- Challenges and Insights

Chan-Jan Hsu, Yi-Cheng Lin, Chia-Chun Lin, Wei-Chih Chen, Ho Lam, Chung, Chen-An Li, Yi-Chang Chen, Chien-Yu Yu, Ming-Ji Lee, Chien-Cheng Chen,, Ru-Heng Huang, Hung-yi Lee, Da-Shan Shiu

PDF

Open Access 1 Models

TL;DR

BreezyVoice is a TTS system tailored for Taiwanese Mandarin that effectively handles polyphone disambiguation and generates high-fidelity speech, demonstrating robustness across various contexts and speaker variations.

Contribution

The paper introduces BreezyVoice, integrating novel components like the $S^{3}$ tokenizer and OT-CFM to improve polyphone disambiguation and speech realism in Taiwanese Mandarin TTS.

Findings

01

Superior performance in general and code-switching contexts

02

Enhanced robustness for long-tail speakers

03

Valuable insights into neural codec TTS systems

Abstract

We present BreezyVoice, a Text-to-Speech (TTS) system specifically adapted for Taiwanese Mandarin, highlighting phonetic control abilities to address the unique challenges of polyphone disambiguation in the language. Building upon CosyVoice, we incorporate a $S^{3}$ tokenizer, a large language model (LLM), an optimal-transport conditional flow matching model (OT-CFM), and a grapheme to phoneme prediction model, to generate realistic speech that closely mimics human utterances. Our evaluation demonstrates BreezyVoice's superior performance in both general and code-switching contexts, highlighting its robustness and effectiveness in generating high-fidelity speech. Additionally, we address the challenges of generalizability in modeling long-tail speakers and polyphone disambiguation. Our approach significantly enhances performance and offers valuable insights into the workings of neural…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
MediaTek-Research/BreezyVoice
model· ♡ 52
♡ 52

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research · Speech and Audio Processing