PortaSpeech: Portable and High-Quality Generative Text-to-Speech

Yi Ren; Jinglin Liu; Zhou Zhao

arXiv:2109.15166·eess.AS·February 15, 2022·6 cites

PortaSpeech: Portable and High-Quality Generative Text-to-Speech

Yi Ren, Jinglin Liu, Zhou Zhao

PDF

Open Access 4 Repos 1 Video

TL;DR

PortaSpeech is a lightweight, high-quality TTS model that effectively captures prosody and details, outperforming existing models while maintaining a small size and low memory footprint.

Contribution

The paper introduces PortaSpeech, a novel portable TTS architecture combining a lightweight VAE, flow-based post-net, grouped parameter sharing, and a mixture alignment mechanism for improved speech synthesis.

Findings

01

Outperforms other TTS models in voice quality and prosody.

02

Maintains high performance with only 6.7M parameters, about 4x smaller than FastSpeech 2.

03

Each design component in PortaSpeech is validated through ablation studies.

Abstract

Non-autoregressive text-to-speech (NAR-TTS) models such as FastSpeech 2 and Glow-TTS can synthesize high-quality speech from the given text in parallel. After analyzing two kinds of generative NAR-TTS models (VAE and normalizing flow), we find that: VAE is good at capturing the long-range semantics features (e.g., prosody) even with small model size but suffers from blurry and unnatural results; and normalizing flow is good at reconstructing the frequency bin-wise details but performs poorly when the number of model parameters is limited. Inspired by these observations, to generate diverse speech with natural details and rich prosody using a lightweight architecture, we propose PortaSpeech, a portable and high-quality generative text-to-speech model. Specifically, 1) to model both the prosody and mel-spectrogram details accurately, we adopt a lightweight VAE with an enhanced prior…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

PortaSpeech: Portable and High-Quality Generative Text-to-Speech· slideslive

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Invertible 1x1 Convolution · Activation Normalization · Normalizing Flows · GLOW · Dense Connections · Layer Normalization · Position-Wise Feed-Forward Layer · *Communicated@Fast*How Do I Communicate to Expedia?