DPI-TTS: Directional Patch Interaction for Fast-Converging and Style Temporal Modeling in Text-to-Speech
Xin Qi, Ruibo Fu, Zhengqi Wen, Tao Wang, Chunyu Qiang, Jianhua Tao,, Chenxing Li, Yi Lu, Shuchen Shi, Zhiyong Wang, Xiaopeng Wang, Yuankun Xie,, Yukun Liu, Xuefei Liu, Guanjun Li

TL;DR
DPI-TTS introduces a novel speech diffusion model that accelerates training and enhances naturalness by incorporating directional patch interactions and style-aware temporal modeling tailored to speech's acoustic properties.
Contribution
It presents DPI-TTS, a new method that improves training speed and speech naturalness by integrating directional patch interactions and style temporal modeling in diffusion-based TTS.
Findings
Training speed increased by nearly 2 times
Significant improvement in speech naturalness
Enhanced speaker style similarity
Abstract
In recent years, speech diffusion models have advanced rapidly. Alongside the widely used U-Net architecture, transformer-based models such as the Diffusion Transformer (DiT) have also gained attention. However, current DiT speech models treat Mel spectrograms as general images, which overlooks the specific acoustic properties of speech. To address these limitations, we propose a method called Directional Patch Interaction for Text-to-Speech (DPI-TTS), which builds on DiT and achieves fast training without compromising accuracy. Notably, DPI-TTS employs a low-to-high frequency, frame-by-frame progressive inference approach that aligns more closely with acoustic properties, enhancing the naturalness of the generated speech. Additionally, we introduce a fine-grained style temporal modeling method that further improves speaker style similarity. Experimental results demonstrate that our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Speech Recognition and Synthesis · Natural Language Processing Techniques
Methods*Communicated@Fast*How Do I Communicate to Expedia? · Attention Is All You Need · Linear Layer · Multi-Head Attention · Convolution · Label Smoothing · Byte Pair Encoding · Absolute Position Encodings · Softmax · Layer Normalization
