Beyond Two-stage Diffusion TTS: Joint Structure and Content Refinement via Jump Diffusion
Jiabao Ai, Minghui Zhao, and Anton Ragni

TL;DR
This paper introduces a jump-diffusion framework for TTS that jointly models temporal structure and spectral content, improving prosody and alignment stability over traditional two-stage and single-stage methods.
Contribution
It proposes a novel unified jump-diffusion approach that combines discrete and continuous modeling for TTS, enabling better prosody control and alignment stability.
Findings
Achieves 3.37% WER, outperforming Grad-TTS at 4.38%.
Improves UTMOSv2 scores on LJSpeech.
Enables adaptive prosody with natural pauses.
Abstract
Diffusion and flow matching TTS faces a tension between discrete temporal structure and continuous spectral modeling. Two-stage models diffuse on fixed alignments, often collapsing to mean prosody; single-stage models avoid explicit durations but suffer alignment instability. We propose a jump-diffusion framework where discrete jumps model temporal structure and continuous diffusion refines spectral content within one process. Even in its one-shot degenerate form, our framework achieves 3.37% WER vs. 4.38% for Grad-TTS with improved UTMOSv2 on LJSpeech. The full iterative UDD variant further enables adaptive prosody, autonomously inserting natural pauses in out-of-distribution slow speech rather than stretching uniformly. Audio samples are available at https://anonymousinterpseech.github.io/TTS_Demo/.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Voice and Speech Disorders
