Cloning one's voice using very limited data in the wild
Dongyang Dai, Yuanzhe Chen, Li Chen, Ming Tu, Lu Liu, Rui Xia, Qiao, Tian, Yuping Wang, Yuxuan Wang

TL;DR
This paper introduces the Hieratron model for voice cloning that effectively uses limited data and allows independent control of style and prosody, improving speech quality and flexibility.
Contribution
The paper presents a novel Hieratron framework that models prosody and timbre separately, enabling high-quality voice cloning with minimal data and style control.
Findings
Hieratron outperforms traditional methods on limited data
Speech quality improved by over 0.2 points in MOS
Achieves independent control of style and prosody
Abstract
With the increasing popularity of speech synthesis products, the industry has put forward more requirements for personalized speech synthesis: (1) How to use low-resource, easily accessible data to clone a person's voice. (2) How to clone a person's voice while controlling the style and prosody. To solve the above two problems, we proposed the Hieratron model framework in which the prosody and timbre are modeled separately using two modules, therefore, the independent control of timbre and the other characteristics of audio can be achieved while generating speech. The practice shows that, for very limited target speaker data in the wild, Hieratron has obvious advantages over the traditional method, in addition to controlling the style and language of the generated speech, the mean opinion score on speech quality of the generated speech has also been improved by more than 0.2 points.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
