DPI-TTS: Directional Patch Interaction for Fast-Converging and Style   Temporal Modeling in Text-to-Speech

Xin Qi; Ruibo Fu; Zhengqi Wen; Tao Wang; Chunyu Qiang; Jianhua Tao,; Chenxing Li; Yi Lu; Shuchen Shi; Zhiyong Wang; Xiaopeng Wang; Yuankun Xie,; Yukun Liu; Xuefei Liu; Guanjun Li

arXiv:2409.11835·cs.SD·September 19, 2024

DPI-TTS: Directional Patch Interaction for Fast-Converging and Style Temporal Modeling in Text-to-Speech

Xin Qi, Ruibo Fu, Zhengqi Wen, Tao Wang, Chunyu Qiang, Jianhua Tao,, Chenxing Li, Yi Lu, Shuchen Shi, Zhiyong Wang, Xiaopeng Wang, Yuankun Xie,, Yukun Liu, Xuefei Liu, Guanjun Li

PDF

Open Access

TL;DR

DPI-TTS introduces a novel speech diffusion model that accelerates training and enhances naturalness by incorporating directional patch interactions and style-aware temporal modeling tailored to speech's acoustic properties.

Contribution

It presents DPI-TTS, a new method that improves training speed and speech naturalness by integrating directional patch interactions and style temporal modeling in diffusion-based TTS.

Findings

01

Training speed increased by nearly 2 times

02

Significant improvement in speech naturalness

03

Enhanced speaker style similarity

Abstract

In recent years, speech diffusion models have advanced rapidly. Alongside the widely used U-Net architecture, transformer-based models such as the Diffusion Transformer (DiT) have also gained attention. However, current DiT speech models treat Mel spectrograms as general images, which overlooks the specific acoustic properties of speech. To address these limitations, we propose a method called Directional Patch Interaction for Text-to-Speech (DPI-TTS), which builds on DiT and achieves fast training without compromising accuracy. Notably, DPI-TTS employs a low-to-high frequency, frame-by-frame progressive inference approach that aligns more closely with acoustic properties, enhancing the naturalness of the generated speech. Additionally, we introduce a fine-grained style temporal modeling method that further improves speaker style similarity. Experimental results demonstrate that our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Speech Recognition and Synthesis · Natural Language Processing Techniques

Methods*Communicated@Fast*How Do I Communicate to Expedia? · Attention Is All You Need · Linear Layer · Multi-Head Attention · Convolution · Label Smoothing · Byte Pair Encoding · Absolute Position Encodings · Softmax · Layer Normalization