Text-driven Emotional Style Control and Cross-speaker Style Transfer in   Neural TTS

Yookyung Shin; Younggun Lee; Suhee Jo; Yeongtae Hwang; Taesu Kim

arXiv:2207.06000·cs.CL·July 14, 2022·1 cites

Text-driven Emotional Style Control and Cross-speaker Style Transfer in Neural TTS

Yookyung Shin, Younggun Lee, Suhee Jo, Yeongtae Hwang, Taesu Kim

PDF

Open Access

TL;DR

This paper introduces a text-based interface for controlling emotional style and transferring styles across speakers in neural TTS, enabling expressive speech synthesis without requiring target speaker recordings.

Contribution

It proposes a bi-modal style encoder linking text descriptions and speech styles, and a novel style loss to enhance cross-speaker style transfer on multi-style datasets.

Findings

01

High-quality expressive speech generated in unseen styles

02

Effective cross-speaker style transfer on disjoint datasets

03

Style control achieved through text descriptions without target recordings

Abstract

Expressive text-to-speech has shown improved performance in recent years. However, the style control of synthetic speech is often restricted to discrete emotion categories and requires training data recorded by the target speaker in the target style. In many practical situations, users may not have reference speech recorded in target emotion but still be interested in controlling speech style just by typing text description of desired emotional style. In this paper, we propose a text-based interface for emotional style control and cross-speaker style transfer in multi-speaker TTS. We propose the bi-modal style encoder which models the semantic relationship between text description embedding and speech style embedding with a pretrained language model. To further improve cross-speaker style transfer on disjoint, multi-style datasets, we propose the novel style loss. The experimental…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis