Beyond Two-stage Diffusion TTS: Joint Structure and Content Refinement via Jump Diffusion

Jiabao Ai; Minghui Zhao; and Anton Ragni

arXiv:2603.14032·eess.AS·March 17, 2026

Beyond Two-stage Diffusion TTS: Joint Structure and Content Refinement via Jump Diffusion

Jiabao Ai, Minghui Zhao, and Anton Ragni

PDF

Open Access

TL;DR

This paper introduces a jump-diffusion framework for TTS that jointly models temporal structure and spectral content, improving prosody and alignment stability over traditional two-stage and single-stage methods.

Contribution

It proposes a novel unified jump-diffusion approach that combines discrete and continuous modeling for TTS, enabling better prosody control and alignment stability.

Findings

01

Achieves 3.37% WER, outperforming Grad-TTS at 4.38%.

02

Improves UTMOSv2 scores on LJSpeech.

03

Enables adaptive prosody with natural pauses.

Abstract

Diffusion and flow matching TTS faces a tension between discrete temporal structure and continuous spectral modeling. Two-stage models diffuse on fixed alignments, often collapsing to mean prosody; single-stage models avoid explicit durations but suffer alignment instability. We propose a jump-diffusion framework where discrete jumps model temporal structure and continuous diffusion refines spectral content within one process. Even in its one-shot degenerate form, our framework achieves 3.37% WER vs. 4.38% for Grad-TTS with improved UTMOSv2 on LJSpeech. The full iterative UDD variant further enables adaptive prosody, autonomously inserting natural pauses in out-of-distribution slow speech rather than stretching uniformly. Audio samples are available at https://anonymousinterpseech.github.io/TTS_Demo/.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Voice and Speech Disorders