Decoding Order Matters in Autoregressive Speech Synthesis

Minghui Zhao; Anton Ragni

arXiv:2601.08450·cs.SD·January 14, 2026

Decoding Order Matters in Autoregressive Speech Synthesis

Minghui Zhao, Anton Ragni

PDF

Open Access

TL;DR

This paper explores how the order of decoding in autoregressive speech synthesis impacts quality, proposing a masked diffusion framework that enables flexible decoding orders and demonstrates the superiority of adaptive strategies over fixed ones.

Contribution

It introduces a masked diffusion approach allowing arbitrary decoding orders and shows adaptive decoding strategies outperform fixed left-to-right order in speech synthesis.

Findings

01

Random decoding order influences speech quality.

02

Adaptive decoding strategies outperform fixed orders.

03

1-bit quantisation can still produce high-quality speech.

Abstract

Autoregressive speech synthesis often adopts a left-to-right order, yet generation order is a modelling choice. We investigate decoding order through masked diffusion framework, which progressively unmasks positions and allows arbitrary decoding orders during training and inference. By interpolating between identity and random permutations, we show that randomness in decoding order affects speech quality. We further compare fixed strategies, such as \texttt{l2r} and \texttt{r2l} with adaptive ones, such as Top- $K$ , finding that fixed-order decoding, including the dominating left-to-right approach, is suboptimal, while adaptive decoding yields better performance. Finally, since masked diffusion requires discrete inputs, we quantise acoustic representations and find that even 1-bit quantisation can support reasonably high-quality speech.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Stochastic Gradient Optimization Techniques · Speech and Audio Processing