Description-based Controllable Text-to-Speech with Cross-Lingual Voice   Control

Ryuichi Yamamoto; Yuma Shirahata; Masaya Kawamura; Kentaro Tachibana

arXiv:2409.17452·eess.AS·September 27, 2024

Description-based Controllable Text-to-Speech with Cross-Lingual Voice Control

Ryuichi Yamamoto, Yuma Shirahata, Masaya Kawamura, Kentaro Tachibana

PDF

Open Access

TL;DR

This paper introduces a description-based controllable TTS system with cross-lingual voice control, leveraging shared disentangled representations to enable natural and controllable speech synthesis across languages without paired data.

Contribution

It presents a novel approach combining TTS and description control models with shared SSL-based representations for cross-lingual voice control without paired data.

Findings

01

High naturalness in English and Japanese TTS

02

Effective disentangled control of voice style and timbre

03

Cross-lingual voice manipulation without paired data

Abstract

We propose a novel description-based controllable text-to-speech (TTS) method with cross-lingual control capability. To address the lack of audio-description paired data in the target language, we combine a TTS model trained on the target language with a description control model trained on another language, which maps input text descriptions to the conditional features of the TTS model. These two models share disentangled timbre and style representations based on self-supervised learning (SSL), allowing for disentangled voice control, such as controlling speaking styles while retaining the original timbre. Furthermore, because the SSL-based timbre and style representations are language-agnostic, combining the TTS and description control models while sharing the same embedding space effectively enables cross-lingual control of voice characteristics. Experiments on English and Japanese…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech and dialogue systems