Fast and High-Quality Auto-Regressive Speech Synthesis via Speculative Decoding
Bohan Li, Hankun Wang, Situo Zhang, Yiwei Guo, Kai Yu

TL;DR
VADUSA introduces speculative decoding techniques to significantly accelerate auto-regressive TTS inference while maintaining high speech quality, leveraging draft heads and a tolerance mechanism for improved performance.
Contribution
The paper presents VADUSA, a novel speculative decoding method that speeds up auto-regressive TTS and enhances quality through draft heads and a tolerance mechanism.
Findings
Significant inference speedup in TTS systems
Improved speech synthesis quality with speculative decoding
Effective generalization across datasets and speech types
Abstract
The auto-regressive architecture, like GPTs, is widely used in modern Text-to-Speech (TTS) systems. However, it incurs substantial inference time, particularly due to the challenges in the next-token prediction posed by lengthy sequences of speech tokens. In this work, we introduce VADUSA, one of the first approaches to accelerate auto-regressive TTS through speculative decoding. Our results show that VADUSA not only significantly improves inference speed but also enhances performance by incorporating draft heads to predict future speech content auto-regressively. Furthermore, the inclusion of a tolerance mechanism during sampling accelerates inference without compromising quality. Our approach demonstrates strong generalization across large datasets and various types of speech tokens.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and Audio Processing
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
