FPETS : Fully Parallel End-to-End Text-to-Speech System
Dabiao Ma, Zhiba Su, Wenxuan Wang, Yuhao Lu

TL;DR
FPETS is a fully parallel, non-autoregressive end-to-end TTS system that significantly speeds up speech synthesis while maintaining or improving quality, addressing latency and error issues of previous models.
Contribution
Introduces FPETS, the first fully parallel end-to-end TTS system utilizing UFANS and a new alignment model for faster and more accurate speech synthesis.
Findings
FPETS is 600X faster than Tacotron2.
Generates speech with equal or better quality.
Reduces errors like mispronunciations and skipped words.
Abstract
End-to-end Text-to-speech (TTS) system can greatly improve the quality of synthesised speech. But it usually suffers form high time latency due to its auto-regressive structure. And the synthesised speech may also suffer from some error modes, e.g. repeated words, mispronunciations, and skipped words. In this paper, we propose a novel non-autoregressive, fully parallel end-to-end TTS system (FPETS). It utilizes a new alignment model and the recently proposed U-shape convolutional structure, UFANS. Different from RNN, UFANS can capture long term information in a fully parallel manner. Trainable position encoding and two-step training strategy are used for learning better alignments. Experimental results show FPETS utilizes the power of parallel computation and reaches a significant speed up of inference compared with state-of-the-art end-to-end TTS systems. More specifically, FPETS is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
