Autoregressive Diffusion Transformer for Text-to-Speech Synthesis
Zhijun Liu, Shuai Wang, Sho Inoue, Qibing Bai, Haizhou Li

TL;DR
This paper introduces ARDiT, a diffusion transformer model that encodes audio as continuous vectors for high-quality, low-latency text-to-speech synthesis, surpassing state-of-the-art performance in zero-shot scenarios.
Contribution
The novel ARDiT model uses continuous audio encoding and diffusion-based autoregressive generation, enabling near-flawless speech reconstruction and efficient sampling with reduced latency.
Findings
ARDiT achieves state-of-the-art zero-shot TTS performance.
Integral Kullback-Leibler divergence improves sample quality.
Model generates 170 ms of 24 kHz speech per step with minimal degradation.
Abstract
Audio language models have recently emerged as a promising approach for various audio generation tasks, relying on audio tokenizers to encode waveforms into sequences of discrete symbols. Audio tokenization often poses a necessary compromise between code bitrate and reconstruction accuracy. When dealing with low-bitrate audio codes, language models are constrained to process only a subset of the information embedded in the audio, which in turn restricts their generative capabilities. To circumvent these issues, we propose encoding audio as vector sequences in continuous space and autoregressively generating these sequences using a decoder-only diffusion transformer (ARDiT). Our findings indicate that ARDiT excels in zero-shot text-to-speech and exhibits performance that compares to or even surpasses that of state-of-the-art models. High-bitrate continuous speech…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing
MethodsDiffusion
