ToneUnit: A Speech Discretization Approach for Tonal Language Speech   Synthesis

Dehua Tao; Daxin Tan; Yu Ting Yeung; Xiao Chen; Tan Lee

arXiv:2406.08989·eess.AS·September 4, 2024

ToneUnit: A Speech Discretization Approach for Tonal Language Speech Synthesis

Dehua Tao, Daxin Tan, Yu Ting Yeung, Xiao Chen, Tan Lee

PDF

Open Access

TL;DR

ToneUnit introduces a novel speech discretization framework for tonal languages like Mandarin, effectively addressing tone shift issues in synthesis and performing well even with limited annotated data.

Contribution

The paper proposes ToneUnit, a tone-aware speech discretization method using CTC supervision, improving tonal speech synthesis quality.

Findings

01

ToneUnit resolves tone shift in Mandarin speech synthesis.

02

Finite scalar quantization enhances ToneUnit's effectiveness.

03

Effective with minimal annotated data.

Abstract

Representing speech as discretized units has numerous benefits in supporting downstream spoken language processing tasks. However, the approach has been less explored in speech synthesis of tonal languages like Mandarin Chinese. Our preliminary experiments on Chinese speech synthesis reveal the issue of "tone shift", where a synthesized speech utterance contains correct base syllables but incorrect tones. To address the issue, we propose the ToneUnit framework, which leverages annotated data with tone labels as CTC supervision to learn tone-aware discrete speech units for Mandarin Chinese speech. Our findings indicate that the discrete units acquired through the TonUnit resolve the "tone shift" issue in synthesized Chinese speech and yield favorable results in English synthesis. Moreover, the experimental results suggest that finite scalar quantization enhances the effectiveness of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsBalanced Selection