Generative Models for Improved Naturalness, Intelligibility, and Voicing   of Whispered Speech

Dominik Wagner; Sebastian P. Bayerl; Hector A. Cordourier Maruri,; Tobias Bocklet

arXiv:2212.01775·cs.SD·January 31, 2023

Generative Models for Improved Naturalness, Intelligibility, and Voicing of Whispered Speech

Dominik Wagner, Sebastian P. Bayerl, Hector A. Cordourier Maruri,, Tobias Bocklet

PDF

Open Access

TL;DR

This paper explores adapting generative models, specifically VQ-VAEs and MelGANs, to convert whispered speech into natural, intelligible, and voiced speech, showing significant improvements over baseline methods.

Contribution

It introduces a novel conditioning approach for generative models to enhance whispered speech conversion, demonstrating substantial objective and subjective quality improvements.

Findings

01

At least 25% reduction in Mel cepstral distortion compared to baseline

02

Significant improvements in naturalness, intelligibility, and voicing in subjective tests

03

Latent speech representation differences confirm the effectiveness of the proposed approach

Abstract

This work adapts two recent architectures of generative models and evaluates their effectiveness for the conversion of whispered speech to normal speech. We incorporate the normal target speech into the training criterion of vector-quantized variational autoencoders (VQ-VAEs) and MelGANs, thereby conditioning the systems to recover voiced speech from whispered inputs. Objective and subjective quality measures indicate that both VQ-VAEs and MelGANs can be modified to perform the conversion task. We find that the proposed approaches significantly improve the Mel cepstral distortion (MCD) metric by at least 25% relative to a DiscoGAN baseline. Subjective listening tests suggest that the MelGAN-based system significantly improves naturalness, intelligibility, and voicing compared to the whispered input speech. A novel evaluation measure based on differences between latent speech…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Phonetics and Phonology Research