Voice Conversion for Whispered Speech Synthesis
Marius Cotescu, Thomas Drugman, Goeric Huybrechts, Jaime, Lorenzo-Trueba, Alexis Moinet

TL;DR
This paper introduces a voice conversion approach for whisper synthesis using GMM and DNN models, outperforming rule-based methods and achieving naturalness comparable to real whispers, with successful generalization to unseen speakers.
Contribution
It demonstrates the effectiveness of DNN-based voice conversion for whisper synthesis and its application in Amazon Alexa's Whisper Mode, surpassing traditional signal processing techniques.
Findings
VC techniques outperform rule-based methods
Converted whispers are indistinguishable from natural whispers
DNN generalizes well to unseen speakers
Abstract
We present an approach to synthesize whisper by applying a handcrafted signal processing recipe and Voice Conversion (VC) techniques to convert normally phonated speech to whispered speech. We investigate using Gaussian Mixture Models (GMM) and Deep Neural Networks (DNN) to model the mapping between acoustic features of normal speech and those of whispered speech. We evaluate naturalness and speaker similarity of the converted whisper on an internal corpus and on the publicly available wTIMIT corpus. We show that applying VC techniques is significantly better than using rule-based signal processing methods and it achieves results that are indistinguishable from copy-synthesis of natural whisper recordings. We investigate the ability of the DNN model to generalize on unseen speakers, when trained with data from multiple speakers. We show that excluding the target speaker from the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
