On the Interplay Between Sparsity, Naturalness, Intelligibility, and   Prosody in Speech Synthesis

Cheng-I Jeff Lai; Erica Cooper; Yang Zhang; Shiyu Chang; Kaizhi Qian,; Yi-Lun Liao; Yung-Sung Chuang; Alexander H. Liu; Junichi Yamagishi; David; Cox; James Glass

arXiv:2110.01147·cs.SD·October 29, 2021

On the Interplay Between Sparsity, Naturalness, Intelligibility, and Prosody in Speech Synthesis

Cheng-I Jeff Lai, Erica Cooper, Yang Zhang, Shiyu Chang, Kaizhi Qian,, Yi-Lun Liao, Yung-Sung Chuang, Alexander H. Liu, Junichi Yamagishi, David, Cox, James Glass

PDF

Open Access

TL;DR

This paper investigates the effects of pruning on end-to-end text-to-speech models, revealing that significant pruning can maintain or even improve speech naturalness, intelligibility, and prosody, supported by extensive subjective and objective evaluations.

Contribution

It demonstrates that end-to-end TTS models are highly prunable without loss of quality, and explores methods like finetuning data, TTS-augmentation, and knowledge distillation to optimize pruning strategies.

Findings

01

Pruned TTS models can match or surpass original models in naturalness.

02

Pruning does not necessarily degrade speech quality, and can sometimes enhance it.

03

The study provides a large dataset of pruned models and evaluation results.

Abstract

Are end-to-end text-to-speech (TTS) models over-parametrized? To what extent can these models be pruned, and what happens to their synthesis capabilities? This work serves as a starting point to explore pruning both spectrogram prediction networks and vocoders. We thoroughly investigate the tradeoffs between sparsity and its subsequent effects on synthetic speech. Additionally, we explored several aspects of TTS pruning: amount of finetuning data versus sparsity, TTS-Augmentation to utilize unspoken text, and combining knowledge distillation and pruning. Our findings suggest that not only are end-to-end TTS models highly prunable, but also, perhaps surprisingly, pruned TTS models can produce synthetic speech with equal or higher naturalness and intelligibility, with similar prosody. All of our experiments are conducted on publicly available models, and findings in this work are backed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems

MethodsPruning · Knowledge Distillation