Expressive, Variable, and Controllable Duration Modelling in TTS
Ammar Abbas, Thomas Merritt, Alexis Moinet, Sri Karlapati, Ewa, Muszynska, Simon Slangen, Elia Gatti, Thomas Drugman

TL;DR
This paper introduces two novel duration models for neural TTS that enhance expressiveness and variability, including a phrasing-conditioned model and a multi-speaker normalising flow model called Cauliflow, with controllable pacing features.
Contribution
It presents two new duration modelling approaches for TTS, improving expressiveness, variability, and control over speech pacing compared to existing methods.
Findings
Phrasing-conditioned model improves naturalness and pause modeling.
Cauliflow matches baseline naturalness while enabling variable durations.
Controllable parameters allow intuitive pacing adjustments.
Abstract
Duration modelling has become an important research problem once more with the rise of non-attention neural text-to-speech systems. The current approaches largely fall back to relying on previous statistical parametric speech synthesis technology for duration prediction, which poorly models the expressiveness and variability in speech. In this paper, we propose two alternate approaches to improve duration modelling. First, we propose a duration model conditioned on phrasing that improves the predicted durations and provides better modelling of pauses. We show that the duration model conditioned on phrasing improves the naturalness of speech over our baseline duration model. Second, we also propose a multi-speaker duration model called Cauliflow, that uses normalising flows to predict durations that better match the complex target duration distribution. Cauliflow performs on par with our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling
