WhisperVC: Decoupled Cross-Domain Alignment and Speech Generation for Low-Resource Whisper-to-Normal Conversion

Dong Liu; Juan Liu; Wei Ju; Yao Tian; Ming Li

arXiv:2511.01056·eess.AS·March 11, 2026

WhisperVC: Decoupled Cross-Domain Alignment and Speech Generation for Low-Resource Whisper-to-Normal Conversion

Dong Liu, Juan Liu, Wei Ju, Yao Tian, Ming Li

PDF

Open Access 1 Datasets

TL;DR

WhisperVC is a novel three-stage framework that enables low-resource whisper-to-normal speech conversion by decoupling cross-domain alignment from speech generation, achieving high-quality results with limited data.

Contribution

The paper introduces WhisperVC, a decoupled, three-stage approach for whisper-to-normal conversion that effectively handles low-resource scenarios and preserves speaker identity.

Findings

01

Achieves competitive speech quality metrics (DNSMOS 3.07, UTMOS 2.83)

02

Maintains high speaker similarity (WavLM score 0.95)

03

Supports privacy and rehabilitation applications

Abstract

Whispered speech lacks vocal-fold excitation, making intelligible conversion challenging. We propose WhisperVC, a three-stage framework for low-resource whisper-to-normal (W2N) conversion that decouples cross-domain alignment from speech generation. Stage 1 uses limited paired whisper-normal data with a content encoder and a Conformer-based variational autoencoder (VAE) with soft-DTW alignment to learn domain-invariant semantic representations. Stage 2, trained only on normal speech, employs a Length-Channel Aligner and a two-stage speaker-conditioned mel generator for timbre and prosody modeling. Stage 3 fine-tunes a HiFi-GAN vocoder for waveform synthesis. Experimental results on AISHELL6-Whisper show competitive quality (DNSMOS 3.07, UTMOS 2.83, CER 16.93%) and WavLM speaker similarity (0.95). The framework also supports privacy-preserving communication as well as non-vocal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

SMIIP-lab/AISHELL6-Whisper
dataset· 27 dl
27 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Voice and Speech Disorders