Description-based Controllable Text-to-Speech with Cross-Lingual Voice Control
Ryuichi Yamamoto, Yuma Shirahata, Masaya Kawamura, Kentaro Tachibana

TL;DR
This paper introduces a description-based controllable TTS system with cross-lingual voice control, leveraging shared disentangled representations to enable natural and controllable speech synthesis across languages without paired data.
Contribution
It presents a novel approach combining TTS and description control models with shared SSL-based representations for cross-lingual voice control without paired data.
Findings
High naturalness in English and Japanese TTS
Effective disentangled control of voice style and timbre
Cross-lingual voice manipulation without paired data
Abstract
We propose a novel description-based controllable text-to-speech (TTS) method with cross-lingual control capability. To address the lack of audio-description paired data in the target language, we combine a TTS model trained on the target language with a description control model trained on another language, which maps input text descriptions to the conditional features of the TTS model. These two models share disentangled timbre and style representations based on self-supervised learning (SSL), allowing for disentangled voice control, such as controlling speaking styles while retaining the original timbre. Furthermore, because the SSL-based timbre and style representations are language-agnostic, combining the TTS and description control models while sharing the same embedding space effectively enables cross-lingual control of voice characteristics. Experiments on English and Japanese…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech and dialogue systems
