Should you use a probabilistic duration model in TTS? Probably! Especially for spontaneous speech
Shivam Mehta, Harm Lameris, Rajiv Punmiya, Jonas Beskow, \'Eva, Sz\'ekely, Gustav Eje Henter

TL;DR
This paper demonstrates that using probabilistic duration models, especially for spontaneous speech, improves non-autoregressive TTS quality across various approaches and datasets.
Contribution
It provides a comprehensive comparison showing the benefits of probabilistic duration modeling in NAR TTS, particularly for spontaneous speech, which was underexplored before.
Findings
Probabilistic duration models outperform deterministic ones in NAR TTS.
Stochastic duration modeling enhances TTS quality for spontaneous speech.
Benefits are consistent across multiple corpora and approaches.
Abstract
Converting input symbols to output audio in TTS requires modelling the durations of speech sounds. Leading non-autoregressive (NAR) TTS models treat duration modelling as a regression problem. The same utterance is then spoken with identical timings every time, unlike when a human speaks. Probabilistic models of duration have been proposed, but there is mixed evidence of their benefits. However, prior studies generally only consider speech read aloud, and ignore spontaneous speech, despite the latter being both a more common and a more variable mode of speaking. We compare the effect of conventional deterministic duration modelling to durations sampled from a powerful probabilistic model based on conditional flow matching (OT-CFM), in three different NAR TTS approaches: regression-based, deep generative, and end-to-end. Across four different corpora, stochastic duration modelling…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPhonetics and Phonology Research · Speech Recognition and Synthesis · Speech and Audio Processing
