Few Shot Adaptive Normalization Driven Multi-Speaker Speech Synthesis

Neeraj Kumar; Srishti Goel; Ankur Narang; Brejesh Lall

arXiv:2012.07252·eess.AS·December 15, 2020·5 cites

Few Shot Adaptive Normalization Driven Multi-Speaker Speech Synthesis

Neeraj Kumar, Srishti Goel, Ankur Narang, Brejesh Lall

PDF

Open Access

TL;DR

This paper introduces FSM-SS, a novel few-shot multi-speaker speech synthesis method that uses adaptive normalization and multi-head attention to generate personalized speech styles from limited reference samples.

Contribution

It proposes a new adaptive normalization-based architecture for few-shot multi-speaker speech synthesis that effectively captures prosody and enables style transfer and morphing.

Findings

01

Achieves high-quality speech synthesis with minimal reference data.

02

Effectively captures prosodic features like energy and pitch.

03

Demonstrates superior performance on VCTK and LibriTTS datasets.

Abstract

The style of the speech varies from person to person and every person exhibits his or her own style of speaking that is determined by the language, geography, culture and other factors. Style is best captured by prosody of a signal. High quality multi-speaker speech synthesis while considering prosody and in a few shot manner is an area of active research with many real-world applications. While multiple efforts have been made in this direction, it remains an interesting and challenging problem. In this paper, we present a novel few shot multi-speaker speech synthesis approach (FSM-SS) that leverages adaptive normalization architecture with a non-autoregressive multi-head attention model. Given an input text and a reference speech sample of an unseen person, FSM-SS can generate speech in that person's style in a few shot manner. Additionally, we demonstrate how the affine parameters of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing

MethodsAttention Is All You Need · Softmax · Linear Layer · Multi-Head Attention