U-Style: Cascading U-nets with Multi-level Speaker and Style Modeling   for Zero-Shot Voice Cloning

Tao Li; Zhichao Wang; Xinfa Zhu; Jian Cong; Qiao Tian; Yuping Wang,; Lei Xie

arXiv:2310.04004·cs.SD·October 9, 2023

U-Style: Cascading U-nets with Multi-level Speaker and Style Modeling for Zero-Shot Voice Cloning

Tao Li, Zhichao Wang, Xinfa Zhu, Jian Cong, Qiao Tian, Yuping Wang,, Lei Xie

PDF

Open Access

TL;DR

U-Style introduces a multi-level, disentangled zero-shot voice cloning framework that significantly improves naturalness, speaker similarity, and style transfer flexibility for unseen speakers and styles.

Contribution

The paper proposes U-Style, a novel cascading U-net architecture with multi-level modeling and normalization techniques for improved zero-shot speaker and style cloning.

Findings

01

Outperforms state-of-the-art in naturalness and speaker similarity

02

Enables style transfer between unseen speakers

03

Achieves better disentanglement of speaker and style representations

Abstract

Zero-shot speaker cloning aims to synthesize speech for any target speaker unseen during TTS system building, given only a single speech reference of the speaker at hand. Although more practical in real applications, the current zero-shot methods still produce speech with undesirable naturalness and speaker similarity. Moreover, endowing the target speaker with arbitrary speaking styles in the zero-shot setup has not been considered. This is because the unique challenge of zero-shot speaker and style cloning is to learn the disentangled speaker and style representations from only short references representing an arbitrary speaker and an arbitrary style. To address this challenge, we propose U-Style, which employs Grad-TTS as the backbone, particularly cascading a speaker-specific encoder and a style-specific encoder between the text encoder and the diffusion decoder. Thus, leveraging…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing