On the Interplay Between Sparsity, Naturalness, Intelligibility, and Prosody in Speech Synthesis
Cheng-I Jeff Lai, Erica Cooper, Yang Zhang, Shiyu Chang, Kaizhi Qian,, Yi-Lun Liao, Yung-Sung Chuang, Alexander H. Liu, Junichi Yamagishi, David, Cox, James Glass

TL;DR
This paper investigates the effects of pruning on end-to-end text-to-speech models, revealing that significant pruning can maintain or even improve speech naturalness, intelligibility, and prosody, supported by extensive subjective and objective evaluations.
Contribution
It demonstrates that end-to-end TTS models are highly prunable without loss of quality, and explores methods like finetuning data, TTS-augmentation, and knowledge distillation to optimize pruning strategies.
Findings
Pruned TTS models can match or surpass original models in naturalness.
Pruning does not necessarily degrade speech quality, and can sometimes enhance it.
The study provides a large dataset of pruned models and evaluation results.
Abstract
Are end-to-end text-to-speech (TTS) models over-parametrized? To what extent can these models be pruned, and what happens to their synthesis capabilities? This work serves as a starting point to explore pruning both spectrogram prediction networks and vocoders. We thoroughly investigate the tradeoffs between sparsity and its subsequent effects on synthetic speech. Additionally, we explored several aspects of TTS pruning: amount of finetuning data versus sparsity, TTS-Augmentation to utilize unspoken text, and combining knowledge distillation and pruning. Our findings suggest that not only are end-to-end TTS models highly prunable, but also, perhaps surprisingly, pruned TTS models can produce synthetic speech with equal or higher naturalness and intelligibility, with similar prosody. All of our experiments are conducted on publicly available models, and findings in this work are backed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems
MethodsPruning · Knowledge Distillation
