TL;DR
This paper explores combining pretrained and learnable speaker representations in multi-speaker multi-style text-to-speech, demonstrating improved generalization and competitive performance in few-shot voice cloning tasks.
Contribution
It introduces a novel integration of pretrained and learnable speaker embeddings, with voice conversion pretrained embeddings yielding the best results.
Findings
Pretrained voice conversion embeddings outperform other types.
The combined model generalizes well to few-shot speakers.
Achieved 2nd place in ICASSP 2021 M2VoC challenge one-shot track.
Abstract
The few-shot multi-speaker multi-style voice cloning task is to synthesize utterances with voice and speaking style similar to a reference speaker given only a few reference samples. In this work, we investigate different speaker representations and proposed to integrate pretrained and learnable speaker representations. Among different types of embeddings, the embedding pretrained by voice conversion achieves the best performance. The FastSpeech 2 model combined with both pretrained and learnable speaker representations shows great generalization ability on few-shot speakers and achieved 2nd place in the one-shot track of the ICASSP 2021 M2VoC challenge.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAttention Is All You Need · *Communicated@Fast*How Do I Communicate to Expedia? · Dense Connections · Layer Normalization · Residual Connection · Softmax · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer · Dropout
