Median-Based Generation of Synthetic Speech Durations using a Non-Parametric Approach
Srikanth Ronanki, Oliver Watts, Simon King, Gustav Eje Henter

TL;DR
This paper introduces a non-parametric, median-based approach to generate synthetic speech durations using a recurrent model that predicts phone transition probabilities, offering robustness and incremental generation capabilities.
Contribution
The paper presents a novel non-parametric duration modeling method that predicts median durations with a recurrent model, improving robustness and flexibility over traditional parametric approaches.
Findings
The median-based approach is competitive with baseline methods in approximating natural speech durations.
The method enables incremental duration generation and is robust to data irregularities.
It supports modeling durations alongside acoustic features in a unified framework.
Abstract
This paper proposes a new approach to duration modelling for statistical parametric speech synthesis in which a recurrent statistical model is trained to output a phone transition probability at each timestep (acoustic frame). Unlike conventional approaches to duration modelling -- which assume that duration distributions have a particular form (e.g., a Gaussian) and use the mean of that distribution for synthesis -- our approach can in principle model any distribution supported on the non-negative integers. Generation from this model can be performed in many ways; here we consider output generation based on the median predicted duration. The median is more typical (more probable) than the conventional mean duration, is robust to training-data irregularities, and enables incremental generation. Furthermore, a frame-level approach to duration prediction is consistent with a longer-term…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
