Decoding Order Matters in Autoregressive Speech Synthesis
Minghui Zhao, Anton Ragni

TL;DR
This paper explores how the order of decoding in autoregressive speech synthesis impacts quality, proposing a masked diffusion framework that enables flexible decoding orders and demonstrates the superiority of adaptive strategies over fixed ones.
Contribution
It introduces a masked diffusion approach allowing arbitrary decoding orders and shows adaptive decoding strategies outperform fixed left-to-right order in speech synthesis.
Findings
Random decoding order influences speech quality.
Adaptive decoding strategies outperform fixed orders.
1-bit quantisation can still produce high-quality speech.
Abstract
Autoregressive speech synthesis often adopts a left-to-right order, yet generation order is a modelling choice. We investigate decoding order through masked diffusion framework, which progressively unmasks positions and allows arbitrary decoding orders during training and inference. By interpolating between identity and random permutations, we show that randomness in decoding order affects speech quality. We further compare fixed strategies, such as \texttt{l2r} and \texttt{r2l} with adaptive ones, such as Top-, finding that fixed-order decoding, including the dominating left-to-right approach, is suboptimal, while adaptive decoding yields better performance. Finally, since masked diffusion requires discrete inputs, we quantise acoustic representations and find that even 1-bit quantisation can support reasonably high-quality speech.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Stochastic Gradient Optimization Techniques · Speech and Audio Processing
