ECAPA-TDNN for Multi-speaker Text-to-speech Synthesis

Jinlong Xue; Yayue Deng; Yichen Han; Ya Li; Jianqing Sun; Jiaen Liang

arXiv:2203.10473·cs.SD·March 29, 2022

ECAPA-TDNN for Multi-speaker Text-to-speech Synthesis

Jinlong Xue, Yayue Deng, Yichen Han, Ya Li, Jianqing Sun, Jiaen Liang

PDF

Open Access 1 Repo

TL;DR

This paper introduces an end-to-end multi-speaker TTS system using ECAPA-TDNN for improved speaker representation, resulting in higher speech quality and similarity for both seen and unseen speakers.

Contribution

It presents a novel multi-component architecture combining ECAPA-TDNN, FastSpeech2, and HiFi-GAN, with automatic MOS evaluation for speech quality assessment.

Findings

01

Enhanced speaker similarity and naturalness over previous models

02

Effective automatic evaluation of speech quality using deep learning methods

03

Superior performance on both seen and unseen speakers

Abstract

In recent years, neural network based methods for multi-speaker text-to-speech synthesis (TTS) have made significant progress. However, the current speaker encoder models used in these methods still cannot capture enough speaker information. In this paper, we focus on accurate speaker encoder modeling and propose an end-to-end method that can generate high-quality speech and better similarity for both seen and unseen speakers. The proposed architecture consists of three separately trained components: a speaker encoder based on the state-of-the-art ECAPA-TDNN model which is derived from speaker verification task, a FastSpeech2 based synthesizer, and a HiFi-GAN vocoder. The comparison among different speaker encoder models shows our proposed method can achieve better naturalness and similarity. To efficiently evaluate our synthesized speech, we are the first to adopt deep learning based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

2023-MindSpore-1/ms-code-50
mindspore

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems

MethodsDense Connections · Softmax · HiFi-GAN · *Communicated@Fast*How Do I Communicate to Expedia? · Dropout · Linear Layer · Layer Normalization · Attention Is All You Need · Multi-Head Attention · Residual Connection