Speaker Generation
Daisy Stanton, Matt Shannon, Soroosh Mariooryad, RJ Skerry-Ryan, Eric, Battenberg, Tom Bagby, David Kao

TL;DR
This paper introduces TacoSpawn, a recurrent attention-based text-to-speech system capable of generating diverse, human-like voices for nonexistent speakers, with evaluation metrics correlating well with human perception.
Contribution
It presents TacoSpawn, a novel speaker generation model that learns a speaker embedding distribution without transfer learning, enabling diverse voice synthesis.
Findings
TacoSpawn performs competitively on speaker generation tasks.
Objective metrics correlate with human perception.
The system is easy to implement without transfer learning.
Abstract
This work explores the task of synthesizing speech in nonexistent human-sounding voices. We call this task "speaker generation", and present TacoSpawn, a system that performs competitively at this task. TacoSpawn is a recurrent attention-based text-to-speech model that learns a distribution over a speaker embedding space, which enables sampling of novel and diverse speakers. Our method is easy to implement, and does not require transfer learning from speaker ID systems. We present objective and subjective metrics for evaluating performance on this task, and demonstrate that our proposed objective metrics correlate with human perception of speaker similarity. Audio samples are available on our demo page.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Music and Audio Processing
