WhisperVC: Decoupled Cross-Domain Alignment and Speech Generation for Low-Resource Whisper-to-Normal Conversion
Dong Liu, Juan Liu, Wei Ju, Yao Tian, Ming Li

TL;DR
WhisperVC is a novel three-stage framework that enables low-resource whisper-to-normal speech conversion by decoupling cross-domain alignment from speech generation, achieving high-quality results with limited data.
Contribution
The paper introduces WhisperVC, a decoupled, three-stage approach for whisper-to-normal conversion that effectively handles low-resource scenarios and preserves speaker identity.
Findings
Achieves competitive speech quality metrics (DNSMOS 3.07, UTMOS 2.83)
Maintains high speaker similarity (WavLM score 0.95)
Supports privacy and rehabilitation applications
Abstract
Whispered speech lacks vocal-fold excitation, making intelligible conversion challenging. We propose WhisperVC, a three-stage framework for low-resource whisper-to-normal (W2N) conversion that decouples cross-domain alignment from speech generation. Stage 1 uses limited paired whisper-normal data with a content encoder and a Conformer-based variational autoencoder (VAE) with soft-DTW alignment to learn domain-invariant semantic representations. Stage 2, trained only on normal speech, employs a Length-Channel Aligner and a two-stage speaker-conditioned mel generator for timbre and prosody modeling. Stage 3 fine-tunes a HiFi-GAN vocoder for waveform synthesis. Experimental results on AISHELL6-Whisper show competitive quality (DNSMOS 3.07, UTMOS 2.83, CER 16.93%) and WavLM speaker similarity (0.95). The framework also supports privacy-preserving communication as well as non-vocal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Voice and Speech Disorders
