U-DiT TTS: U-Diffusion Vision Transformer for Text-to-Speech
Xin Jing, Yi Chang, Zijiang Yang, Jiangjian Xie, Andreas, Triantafyllopoulos, Bjoern W. Schuller

TL;DR
This paper introduces U-DiT TTS, a novel text-to-speech system using a vision transformer-based diffusion model with a U-Net architecture, achieving state-of-the-art results on the LJSpeech dataset.
Contribution
Proposes U-DiT architecture combining U-Net and Vision Transformer for diffusion-based TTS, demonstrating improved performance and scalability.
Findings
Achieves state-of-the-art MOS scores on LJSpeech.
Outperforms existing diffusion-based TTS models in quality.
Demonstrates the effectiveness of vision transformer in diffusion TTS systems.
Abstract
Deep learning has led to considerable advances in text-to-speech synthesis. Most recently, the adoption of Score-based Generative Models (SGMs), also known as Diffusion Probabilistic Models (DPMs), has gained traction due to their ability to produce high-quality synthesized neural speech in neural speech synthesis systems. In SGMs, the U-Net architecture and its variants have long dominated as the backbone since its first successful adoption. In this research, we mainly focus on the neural network in diffusion-model-based Text-to-Speech (TTS) systems and propose the U-DiT architecture, exploring the potential of vision transformer architecture as the core component of the diffusion models in a TTS system. The modular design of the U-DiT architecture, inherited from the best parts of U-Net and ViT, allows for great scalability and versatility across different data scales. The proposed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling
