FlowW2N: Whispered-to-Normal Speech Conversion via Flow-Matching
Fabian Ritter-Gutierrez, Md Asif Jalal, Pablo Peso Parada, Karthikeyan Saravanan, Yusun Shul, Minseung Kim, Gun-Woo Lee, Han-Gil Moon

TL;DR
FlowW2N introduces a flow-matching method for converting whispered speech to normal speech, leveraging synthetic paired data and invariant features to achieve state-of-the-art results without real paired training data.
Contribution
The paper presents a novel flow-based approach that trains on synthetic data and uses domain-invariant features for effective whispered-to-normal speech conversion.
Findings
Achieves state-of-the-art intelligibility on CHAINS and wTIMIT datasets.
Reduces Word Error Rate by 26-46% relative to prior methods.
Requires only 10 inference steps and no real paired data.
Abstract
Whispered-to-normal (W2N) speech conversion aims to reconstruct missing phonation from whispered input while preserving content and speaker identity. This task is challenging due to temporal misalignment between whisper and voiced recordings and lack of paired data. We propose FlowW2N, a conditional flow matching approach that trains exclusively on synthetic, time-aligned whisper-normal pairs and conditions on domain-invariant features. We exploit high-level ASR embeddings that exhibits strong invariance between synthetic and real whispered speech, enabling generalization to real whispers despite never observing it during training. We verify this invariance across ASR layers and propose a selection criterion optimizing content informativeness and cross-domain invariance. Our method achieves SOTA intelligibility on the CHAINS and wTIMIT datasets, reducing Word Error Rate by 26-46%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Voice and Speech Disorders
