Investigating on Incorporating Pretrained and Learnable Speaker   Representations for Multi-Speaker Multi-Style Text-to-Speech

Chung-Ming Chien; Jheng-Hao Lin; Chien-yu Huang; Po-chun Hsu; Hung-yi; Lee

arXiv:2103.04088·eess.AS·May 4, 2021

Investigating on Incorporating Pretrained and Learnable Speaker Representations for Multi-Speaker Multi-Style Text-to-Speech

Chung-Ming Chien, Jheng-Hao Lin, Chien-yu Huang, Po-chun Hsu, Hung-yi, Lee

PDF

1 Repo

TL;DR

This paper explores combining pretrained and learnable speaker representations in multi-speaker multi-style text-to-speech, demonstrating improved generalization and competitive performance in few-shot voice cloning tasks.

Contribution

It introduces a novel integration of pretrained and learnable speaker embeddings, with voice conversion pretrained embeddings yielding the best results.

Findings

01

Pretrained voice conversion embeddings outperform other types.

02

The combined model generalizes well to few-shot speakers.

03

Achieved 2nd place in ICASSP 2021 M2VoC challenge one-shot track.

Abstract

The few-shot multi-speaker multi-style voice cloning task is to synthesize utterances with voice and speaking style similar to a reference speaker given only a few reference samples. In this work, we investigate different speaker representations and proposed to integrate pretrained and learnable speaker representations. Among different types of embeddings, the embedding pretrained by voice conversion achieves the best performance. The FastSpeech 2 model combined with both pretrained and learnable speaker representations shows great generalization ability on few-shot speakers and achieved 2nd place in the one-shot track of the ICASSP 2021 M2VoC challenge.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ming024/FastSpeech2
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAttention Is All You Need · *Communicated@Fast*How Do I Communicate to Expedia? · Dense Connections · Layer Normalization · Residual Connection · Softmax · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer · Dropout