FPETS : Fully Parallel End-to-End Text-to-Speech System

Dabiao Ma; Zhiba Su; Wenxuan Wang; Yuhao Lu

arXiv:1812.05710·eess.AS·February 11, 2020·1 cites

FPETS : Fully Parallel End-to-End Text-to-Speech System

Dabiao Ma, Zhiba Su, Wenxuan Wang, Yuhao Lu

PDF

Open Access 2 Repos

TL;DR

FPETS is a fully parallel, non-autoregressive end-to-end TTS system that significantly speeds up speech synthesis while maintaining or improving quality, addressing latency and error issues of previous models.

Contribution

Introduces FPETS, the first fully parallel end-to-end TTS system utilizing UFANS and a new alignment model for faster and more accurate speech synthesis.

Findings

01

FPETS is 600X faster than Tacotron2.

02

Generates speech with equal or better quality.

03

Reduces errors like mispronunciations and skipped words.

Abstract

End-to-end Text-to-speech (TTS) system can greatly improve the quality of synthesised speech. But it usually suffers form high time latency due to its auto-regressive structure. And the synthesised speech may also suffer from some error modes, e.g. repeated words, mispronunciations, and skipped words. In this paper, we propose a novel non-autoregressive, fully parallel end-to-end TTS system (FPETS). It utilizes a new alignment model and the recently proposed U-shape convolutional structure, UFANS. Different from RNN, UFANS can capture long term information in a fully parallel manner. Trainable position encoding and two-step training strategy are used for learning better alignments. Experimental results show FPETS utilizes the power of parallel computation and reaches a significant speed up of inference compared with state-of-the-art end-to-end TTS systems. More specifically, FPETS is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings