PortaSpeech: Portable and High-Quality Generative Text-to-Speech
Yi Ren, Jinglin Liu, Zhou Zhao

TL;DR
PortaSpeech is a lightweight, high-quality TTS model that effectively captures prosody and details, outperforming existing models while maintaining a small size and low memory footprint.
Contribution
The paper introduces PortaSpeech, a novel portable TTS architecture combining a lightweight VAE, flow-based post-net, grouped parameter sharing, and a mixture alignment mechanism for improved speech synthesis.
Findings
Outperforms other TTS models in voice quality and prosody.
Maintains high performance with only 6.7M parameters, about 4x smaller than FastSpeech 2.
Each design component in PortaSpeech is validated through ablation studies.
Abstract
Non-autoregressive text-to-speech (NAR-TTS) models such as FastSpeech 2 and Glow-TTS can synthesize high-quality speech from the given text in parallel. After analyzing two kinds of generative NAR-TTS models (VAE and normalizing flow), we find that: VAE is good at capturing the long-range semantics features (e.g., prosody) even with small model size but suffers from blurry and unnatural results; and normalizing flow is good at reconstructing the frequency bin-wise details but performs poorly when the number of model parameters is limited. Inspired by these observations, to generate diverse speech with natural details and rich prosody using a lightweight architecture, we propose PortaSpeech, a portable and high-quality generative text-to-speech model. Specifically, 1) to model both the prosody and mel-spectrogram details accurately, we adopt a lightweight VAE with an enhanced prior…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Invertible 1x1 Convolution · Activation Normalization · Normalizing Flows · GLOW · Dense Connections · Layer Normalization · Position-Wise Feed-Forward Layer · *Communicated@Fast*How Do I Communicate to Expedia?
