Autoregressive Diffusion Transformer for Text-to-Speech Synthesis

Zhijun Liu; Shuai Wang; Sho Inoue; Qibing Bai; Haizhou Li

arXiv:2406.05551·eess.AS·June 11, 2024

Autoregressive Diffusion Transformer for Text-to-Speech Synthesis

Zhijun Liu, Shuai Wang, Sho Inoue, Qibing Bai, Haizhou Li

PDF

Open Access

TL;DR

This paper introduces ARDiT, a diffusion transformer model that encodes audio as continuous vectors for high-quality, low-latency text-to-speech synthesis, surpassing state-of-the-art performance in zero-shot scenarios.

Contribution

The novel ARDiT model uses continuous audio encoding and diffusion-based autoregressive generation, enabling near-flawless speech reconstruction and efficient sampling with reduced latency.

Findings

01

ARDiT achieves state-of-the-art zero-shot TTS performance.

02

Integral Kullback-Leibler divergence improves sample quality.

03

Model generates 170 ms of 24 kHz speech per step with minimal degradation.

Abstract

Audio language models have recently emerged as a promising approach for various audio generation tasks, relying on audio tokenizers to encode waveforms into sequences of discrete symbols. Audio tokenization often poses a necessary compromise between code bitrate and reconstruction accuracy. When dealing with low-bitrate audio codes, language models are constrained to process only a subset of the information embedded in the audio, which in turn restricts their generative capabilities. To circumvent these issues, we propose encoding audio as vector sequences in continuous space $R^{d}$ and autoregressively generating these sequences using a decoder-only diffusion transformer (ARDiT). Our findings indicate that ARDiT excels in zero-shot text-to-speech and exhibits performance that compares to or even surpasses that of state-of-the-art models. High-bitrate continuous speech…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing

MethodsDiffusion