Vocal effort modeling in neural TTS for improving the intelligibility of synthetic speech in noise
Tuomo Raitio, Petko Petkov, Jiangchuan Li, Muhammed Shifas, Andrea, Davis, Yannis Stylianou

TL;DR
This paper introduces a neural TTS approach that models vocal effort variation to enhance synthetic speech intelligibility in noisy environments, by controlling spectral tilt and extrapolating effort levels.
Contribution
It presents a novel spectral tilt conditioning method for neural TTS that enables independent vocal effort control and improves speech intelligibility in noise.
Findings
Enhanced intelligibility in noisy conditions
Maintained speech quality with effort control
Outperformed existing speech enhancement algorithms
Abstract
We present a neural text-to-speech (TTS) method that models natural vocal effort variation to improve the intelligibility of synthetic speech in the presence of noise. The method consists of first measuring the spectral tilt of unlabeled conventional speech data, and then conditioning a neural TTS model with normalized spectral tilt among other prosodic factors. Changing the spectral tilt parameter and keeping other prosodic factors unchanged enables effective vocal effort control at synthesis time independent of other prosodic factors. By extrapolation of the spectral tilt values beyond what has been seen in the original data, we can generate speech with high vocal effort levels, thus improving the intelligibility of speech in the presence of masking noise. We evaluate the intelligibility and quality of normal speech and speech with increased vocal effort in the presence of various…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Acoustic Wave Phenomena Research
