ECAPA-TDNN for Multi-speaker Text-to-speech Synthesis
Jinlong Xue, Yayue Deng, Yichen Han, Ya Li, Jianqing Sun, Jiaen Liang

TL;DR
This paper introduces an end-to-end multi-speaker TTS system using ECAPA-TDNN for improved speaker representation, resulting in higher speech quality and similarity for both seen and unseen speakers.
Contribution
It presents a novel multi-component architecture combining ECAPA-TDNN, FastSpeech2, and HiFi-GAN, with automatic MOS evaluation for speech quality assessment.
Findings
Enhanced speaker similarity and naturalness over previous models
Effective automatic evaluation of speech quality using deep learning methods
Superior performance on both seen and unseen speakers
Abstract
In recent years, neural network based methods for multi-speaker text-to-speech synthesis (TTS) have made significant progress. However, the current speaker encoder models used in these methods still cannot capture enough speaker information. In this paper, we focus on accurate speaker encoder modeling and propose an end-to-end method that can generate high-quality speech and better similarity for both seen and unseen speakers. The proposed architecture consists of three separately trained components: a speaker encoder based on the state-of-the-art ECAPA-TDNN model which is derived from speaker verification task, a FastSpeech2 based synthesizer, and a HiFi-GAN vocoder. The comparison among different speaker encoder models shows our proposed method can achieve better naturalness and similarity. To efficiently evaluate our synthesized speech, we are the first to adopt deep learning based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems
MethodsDense Connections · Softmax · HiFi-GAN · *Communicated@Fast*How Do I Communicate to Expedia? · Dropout · Linear Layer · Layer Normalization · Attention Is All You Need · Multi-Head Attention · Residual Connection
