Neural Network-Based Modeling of Phonetic Durations
Xizi Wei, Melvyn Hunt, Adrian Skilling

TL;DR
This paper introduces a deep neural network model to predict phoneme durations in various contexts, aiding speech synthesis and recognition by analyzing influential factors and addressing challenges with diverse speech data.
Contribution
The paper presents a novel DNN-based approach for modeling phonetic durations and explores factors affecting durations in different speech contexts, including applications in TTS and ASR.
Findings
Major factors influencing durations include pre-pausal lengthening, lexical stress, and speaking rate.
The model effectively checks TTS speech consistency and pronunciation accuracy.
Duration prediction is less accurate with noisy, casual, or children's speech in ASR training data.
Abstract
A deep neural network (DNN)-based model has been developed to predict non-parametric distributions of durations of phonemes in specified phonetic contexts and used to explore which factors influence durations most. Major factors in US English are pre-pausal lengthening, lexical stress, and speaking rate. The model can be used to check that text-to-speech (TTS) training speech follows the script and words are pronounced as expected. Duration prediction is poorer with training speech for automatic speech recognition (ASR) because the training corpus typically consists of single utterances from many speakers and is often noisy or casually spoken. Low probability durations in ASR training material nevertheless mostly correspond to non-standard speech, with some having disfluencies. Children's speech is disproportionately present in these utterances, since children show much more variation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPhonetics and Phonology Research · Speech Recognition and Synthesis · Speech and Audio Processing
