ProsoSpeech: Enhancing Prosody With Quantized Vector Pre-training in   Text-to-Speech

Yi Ren; Ming Lei; Zhiying Huang; Shiliang Zhang; Qian Chen; Zhijie; Yan; Zhou Zhao

arXiv:2202.07816·eess.AS·February 17, 2022

ProsoSpeech: Enhancing Prosody With Quantized Vector Pre-training in Text-to-Speech

Yi Ren, Ming Lei, Zhiying Huang, Shiliang Zhang, Qian Chen, Zhijie, Yan, Zhou Zhao

PDF

Open Access

TL;DR

ProsoSpeech improves expressive text-to-speech by using quantized latent vectors pre-trained on large-scale unpaired data to better model prosody attributes like pitch, duration, and energy.

Contribution

It introduces a novel quantized latent vector approach and a word-level prosody encoder trained on unpaired data for enhanced prosody modeling in TTS.

Findings

01

ProsoSpeech produces speech with richer prosody than baselines.

02

Pre-training on large-scale unpaired data improves prosody modeling.

03

Quantized latent vectors effectively capture prosody attributes.

Abstract

Expressive text-to-speech (TTS) has become a hot research topic recently, mainly focusing on modeling prosody in speech. Prosody modeling has several challenges: 1) the extracted pitch used in previous prosody modeling works have inevitable errors, which hurts the prosody modeling; 2) different attributes of prosody (e.g., pitch, duration and energy) are dependent on each other and produce the natural prosody together; and 3) due to high variability of prosody and the limited amount of high-quality data for TTS training, the distribution of prosody cannot be fully shaped. To tackle these issues, we propose ProsoSpeech, which enhances the prosody using quantized latent vectors pre-trained on large-scale unpaired and low-quality text and speech data. Specifically, we first introduce a word-level prosody encoder, which quantizes the low-frequency band of the speech and compresses prosody…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Phonetics and Phonology Research