Accelerating Autoregressive Speech Synthesis Inference With Speech Speculative Decoding
Zijian Lin, Yang Zhang, Yougen Yuan, Yuming Yan, Jinjiang Liu, Zhiyong Wu, Pengfei Hu, Qun Yu

TL;DR
This paper introduces Speech Speculative Decoding (SSD), a framework that accelerates autoregressive speech synthesis by using a lightweight draft model and parallel verification, achieving 1.4x faster inference without sacrificing quality.
Contribution
The paper proposes a novel SSD framework that significantly speeds up autoregressive speech synthesis using speculative decoding with a lightweight draft model.
Findings
Achieves 1.4x inference speedup over conventional methods
Maintains high fidelity and naturalness in synthesized speech
Subjective evaluations confirm perceptual quality preservation
Abstract
Modern autoregressive speech synthesis models leveraging language models have demonstrated remarkable performance. However, the sequential nature of next token prediction in these models leads to significant latency, hindering their deployment in scenarios where inference speed is critical. In this work, we propose Speech Speculative Decoding (SSD), a novel framework for autoregressive speech synthesis acceleration. Specifically, our method employs a lightweight draft model to generate candidate token sequences, which are subsequently verified in parallel by the target model using the proposed SSD framework. Experimental results demonstrate that SSD achieves a significant speedup of 1.4x compared with conventional autoregressive decoding, while maintaining high fidelity and naturalness. Subjective evaluations further validate the effectiveness of SSD in preserving the perceptual quality…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Natural Language Processing Techniques
MethodsConvolution · Non Maximum Suppression · 1x1 Convolution · SSD · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
