DiTTo-TTS: Diffusion Transformers for Scalable Text-to-Speech without   Domain-Specific Factors

Keon Lee; Dong Won Kim; Jaehyeon Kim; Seungjun Chung; Jaewoong Cho

arXiv:2406.11427·eess.AS·February 18, 2025

DiTTo-TTS: Diffusion Transformers for Scalable Text-to-Speech without Domain-Specific Factors

Keon Lee, Dong Won Kim, Jaehyeon Kim, Seungjun Chung, Jaewoong Cho

PDF

Open Access 1 Repo 1 Video

TL;DR

DiTTo-TTS introduces a diffusion transformer-based text-to-speech model that achieves state-of-the-art performance without relying on domain-specific phoneme and duration factors, scaling effectively to large datasets.

Contribution

The paper demonstrates that a diffusion transformer model can outperform traditional U-Net models in TTS without domain-specific factors, using variable-length modeling and semantic alignment.

Findings

01

Outperforms U-Net with minimal modifications

02

Variable-length modeling improves results

03

Semantic alignment enhances speech quality

Abstract

Large-scale latent diffusion models (LDMs) excel in content generation across various modalities, but their reliance on phonemes and durations in text-to-speech (TTS) limits scalability and access from other fields. While recent studies show potential in removing these domain-specific factors, performance remains suboptimal. In this work, we introduce DiTTo-TTS, a Diffusion Transformer (DiT)-based TTS model, to investigate whether LDM-based TTS can achieve state-of-the-art performance without domain-specific factors. Through rigorous analysis and empirical exploration, we find that (1) DiT with minimal modifications outperforms U-Net, (2) variable-length modeling with a speech length predictor significantly improves results over fixed-length approaches, and (3) conditions like semantic alignment in speech latent representations are key to further enhancement. By scaling our training…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

keonlee9420/evaluate-zero-shot-tts
pytorch

Videos

DiTTo-TTS: Diffusion Transformers for Scalable Text-to-Speech without Domain-Specific Factors· slideslive

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems

MethodsAttention Is All You Need · Concatenated Skip Connection · Max Pooling · *Communicated@Fast*How Do I Communicate to Expedia? · Convolution · U-Net · Residual Connection · Softmax · Layer Normalization · Byte Pair Encoding