Cloning one's voice using very limited data in the wild

Dongyang Dai; Yuanzhe Chen; Li Chen; Ming Tu; Lu Liu; Rui Xia; Qiao; Tian; Yuping Wang; Yuxuan Wang

arXiv:2110.03347·eess.AS·October 11, 2021

Cloning one's voice using very limited data in the wild

Dongyang Dai, Yuanzhe Chen, Li Chen, Ming Tu, Lu Liu, Rui Xia, Qiao, Tian, Yuping Wang, Yuxuan Wang

PDF

Open Access

TL;DR

This paper introduces the Hieratron model for voice cloning that effectively uses limited data and allows independent control of style and prosody, improving speech quality and flexibility.

Contribution

The paper presents a novel Hieratron framework that models prosody and timbre separately, enabling high-quality voice cloning with minimal data and style control.

Findings

01

Hieratron outperforms traditional methods on limited data

02

Speech quality improved by over 0.2 points in MOS

03

Achieves independent control of style and prosody

Abstract

With the increasing popularity of speech synthesis products, the industry has put forward more requirements for personalized speech synthesis: (1) How to use low-resource, easily accessible data to clone a person's voice. (2) How to clone a person's voice while controlling the style and prosody. To solve the above two problems, we proposed the Hieratron model framework in which the prosody and timbre are modeled separately using two modules, therefore, the independent control of timbre and the other characteristics of audio can be achieved while generating speech. The practice shows that, for very limited target speaker data in the wild, Hieratron has obvious advantages over the traditional method, in addition to controlling the style and language of the generated speech, the mean opinion score on speech quality of the generated speech has also been improved by more than 0.2 points.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis