Accelerating Autoregressive Speech Synthesis Inference With Speech Speculative Decoding

Zijian Lin; Yang Zhang; Yougen Yuan; Yuming Yan; Jinjiang Liu; Zhiyong Wu; Pengfei Hu; Qun Yu

arXiv:2505.15380·cs.SD·June 4, 2025

Accelerating Autoregressive Speech Synthesis Inference With Speech Speculative Decoding

Zijian Lin, Yang Zhang, Yougen Yuan, Yuming Yan, Jinjiang Liu, Zhiyong Wu, Pengfei Hu, Qun Yu

PDF

Open Access

TL;DR

This paper introduces Speech Speculative Decoding (SSD), a framework that accelerates autoregressive speech synthesis by using a lightweight draft model and parallel verification, achieving 1.4x faster inference without sacrificing quality.

Contribution

The paper proposes a novel SSD framework that significantly speeds up autoregressive speech synthesis using speculative decoding with a lightweight draft model.

Findings

01

Achieves 1.4x inference speedup over conventional methods

02

Maintains high fidelity and naturalness in synthesized speech

03

Subjective evaluations confirm perceptual quality preservation

Abstract

Modern autoregressive speech synthesis models leveraging language models have demonstrated remarkable performance. However, the sequential nature of next token prediction in these models leads to significant latency, hindering their deployment in scenarios where inference speed is critical. In this work, we propose Speech Speculative Decoding (SSD), a novel framework for autoregressive speech synthesis acceleration. Specifically, our method employs a lightweight draft model to generate candidate token sequences, which are subsequently verified in parallel by the target model using the proposed SSD framework. Experimental results demonstrate that SSD achieves a significant speedup of 1.4x compared with conventional autoregressive decoding, while maintaining high fidelity and naturalness. Subjective evaluations further validate the effectiveness of SSD in preserving the perceptual quality…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Natural Language Processing Techniques

MethodsConvolution · Non Maximum Suppression · 1x1 Convolution · SSD · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings