MultiSpeech: Multi-Speaker Text to Speech with Transformer

Mingjian Chen; Xu Tan; Yi Ren; Jin Xu; Hao Sun; Sheng Zhao; Tao Qin,; Tie-Yan Liu

arXiv:2006.04664·eess.AS·August 4, 2020·30 cites

MultiSpeech: Multi-Speaker Text to Speech with Transformer

Mingjian Chen, Xu Tan, Yi Ren, Jin Xu, Hao Sun, Sheng Zhao, Tao Qin,, Tie-Yan Liu

PDF

Open Access 1 Repo

TL;DR

MultiSpeech is a robust multi-speaker Transformer TTS system that improves text-to-speech alignment and quality, enabling fast inference and high-quality multi-speaker synthesis even with noisy data.

Contribution

The paper introduces novel techniques to enhance Transformer-based multi-speaker TTS, achieving better alignment, quality, and inference speed compared to previous models.

Findings

01

Outperforms naive Transformer TTS in quality and robustness

02

Enables fast inference with a teacher-student training approach

03

Effective on VCTK and LibriTTS datasets

Abstract

Transformer-based text to speech (TTS) model (e.g., Transformer TTS~\cite{li2019neural}, FastSpeech~\cite{ren2019fastspeech}) has shown the advantages of training and inference efficiency over RNN-based model (e.g., Tacotron~\cite{shen2018natural}) due to its parallel computation in training and/or inference. However, the parallel computation increases the difficulty while learning the alignment between text and speech in Transformer, which is further magnified in the multi-speaker scenario with noisy data and diverse speakers, and hinders the applicability of Transformer for multi-speaker TTS. In this paper, we develop a robust and high-quality multi-speaker Transformer TTS system called MultiSpeech, with several specially designed components/techniques to improve text-to-speech alignment: 1) a diagonal constraint on the weight matrix of encoder-decoder attention in both training and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

msalhab96/MultiSpeech
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Label Smoothing · Multi-Head Attention · Adam · *Communicated@Fast*How Do I Communicate to Expedia? · Dropout · Byte Pair Encoding