Whispered and Lombard Neural Speech Synthesis
Qiong Hu, Tobias Bleisch, Petko Petkov, Tuomo Raitio, Erik Marchi,, Varun Lakshminarasimhan

TL;DR
This paper explores methods for generating multiple speaking styles, including Lombard and whisper, using limited data, and demonstrates high-quality synthesis and improved intelligibility, leveraging pre-training, signal processing, and style embedding techniques.
Contribution
It introduces a comprehensive comparison of approaches for multi-style speech synthesis, including a novel use of a speaker verification model as a style encoder.
Findings
High-quality speech generation via pre-training and fine-tuning.
Lombard and whisper styles can be effectively synthesized with limited data.
Lombard speech improves intelligibility significantly.
Abstract
It is desirable for a text-to-speech system to take into account the environment where synthetic speech is presented, and provide appropriate context-dependent output to the user. In this paper, we present and compare various approaches for generating different speaking styles, namely, normal, Lombard, and whisper speech, using only limited data. The following systems are proposed and assessed: 1) Pre-training and fine-tuning a model for each style. 2) Lombard and whisper speech conversion through a signal processing based approach. 3) Multi-style generation using a single model based on a speaker verification model. Our mean opinion score and AB preference listening tests show that 1) we can generate high quality speech through the pre-training/fine-tuning approach for all speaking styles. 2) Although our speaker verification (SV) model is not explicitly trained to discriminate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsSigmoid Activation · Highway Layer · Tanh Activation · Convolution · Dropout · Highway Network · [LivE@PeRson]How do I talk to a real person at Expedia? · Residual GRU · Max Pooling · Dense Connections
