Fine-grained Style Modeling, Transfer and Prediction in Text-to-Speech   Synthesis via Phone-Level Content-Style Disentanglement

Daxin Tan; Tan Lee

arXiv:2011.03943·eess.AS·October 11, 2021

Fine-grained Style Modeling, Transfer and Prediction in Text-to-Speech Synthesis via Phone-Level Content-Style Disentanglement

Daxin Tan, Tan Lee

PDF

TL;DR

This paper introduces a neural network system for fine-grained style modeling, transfer, and prediction in expressive TTS, effectively disentangling content and style at the phone level to improve style transfer and synthesis quality.

Contribution

It proposes a novel phone-level style embedding extraction method combined with collaborative and adversarial learning for better content-style disentanglement in TTS.

Findings

01

Outperforms existing fine-grained style transfer models in content preservation

02

Effective disentanglement of content and style factors in speech

03

Enables both style transfer and TTS synthesis within a unified framework

Abstract

This paper presents a novel design of neural network system for fine-grained style modeling, transfer and prediction in expressive text-to-speech (TTS) synthesis. Fine-grained modeling is realized by extracting style embeddings from the mel-spectrograms of phone-level speech segments. Collaborative learning and adversarial learning strategies are applied in order to achieve effective disentanglement of content and style factors in speech and alleviate the "content leakage" problem in style modeling. The proposed system can be used for varying-content speech style transfer in the single-speaker scenario. The results of objective and subjective evaluation show that our system performs better than other fine-grained speech style transfer models, especially in the aspect of content preservation. By incorporating a style predictor, the proposed system can also be used for text-to-speech…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.