DEX-TTS: Diffusion-based EXpressive Text-to-Speech with Style Modeling   on Time Variability

Hyun Joon Park; Jin Sob Kim; Wooseok Shin; Sung Won Han

arXiv:2406.19135·eess.AS·June 28, 2024

DEX-TTS: Diffusion-based EXpressive Text-to-Speech with Style Modeling on Time Variability

Hyun Joon Park, Jin Sob Kim, Wooseok Shin, Sung Won Han

PDF

Open Access 1 Repo

TL;DR

DEX-TTS introduces a diffusion-based expressive TTS model that effectively captures style variations, including time-invariant and time-variant aspects, leading to improved naturalness and style representation in speech synthesis.

Contribution

The paper proposes a novel diffusion-based TTS framework with specialized style encoders and adapters, enhancing style modeling and generalization without pre-training.

Findings

01

Achieves superior objective and subjective performance in multi-speaker datasets.

02

Effectively models style variations including emotional and speaker-specific styles.

03

Demonstrates robustness on single-speaker TTS tasks.

Abstract

Expressive Text-to-Speech (TTS) using reference speech has been studied extensively to synthesize natural speech, but there are limitations to obtaining well-represented styles and improving model generalization ability. In this study, we present Diffusion-based EXpressive TTS (DEX-TTS), an acoustic model designed for reference-based speech synthesis with enhanced style representations. Based on a general diffusion TTS framework, DEX-TTS includes encoders and adapters to handle styles extracted from reference speech. Key innovations contain the differentiation of styles into time-invariant and time-variant categories for effective style extraction, as well as the design of encoders and adapters with high generalization ability. In addition, we introduce overlapping patchify and convolution-frequency patch embedding strategies to improve DiT-based diffusion networks for TTS. DEX-TTS…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

winddori2002/dex-tts
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques

MethodsDiffusion