FlowW2N: Whispered-to-Normal Speech Conversion via Flow-Matching

Fabian Ritter-Gutierrez; Md Asif Jalal; Pablo Peso Parada; Karthikeyan Saravanan; Yusun Shul; Minseung Kim; Gun-Woo Lee; Han-Gil Moon

arXiv:2603.04296·eess.AS·March 5, 2026

FlowW2N: Whispered-to-Normal Speech Conversion via Flow-Matching

Fabian Ritter-Gutierrez, Md Asif Jalal, Pablo Peso Parada, Karthikeyan Saravanan, Yusun Shul, Minseung Kim, Gun-Woo Lee, Han-Gil Moon

PDF

Open Access

TL;DR

FlowW2N introduces a flow-matching method for converting whispered speech to normal speech, leveraging synthetic paired data and invariant features to achieve state-of-the-art results without real paired training data.

Contribution

The paper presents a novel flow-based approach that trains on synthetic data and uses domain-invariant features for effective whispered-to-normal speech conversion.

Findings

01

Achieves state-of-the-art intelligibility on CHAINS and wTIMIT datasets.

02

Reduces Word Error Rate by 26-46% relative to prior methods.

03

Requires only 10 inference steps and no real paired data.

Abstract

Whispered-to-normal (W2N) speech conversion aims to reconstruct missing phonation from whispered input while preserving content and speaker identity. This task is challenging due to temporal misalignment between whisper and voiced recordings and lack of paired data. We propose FlowW2N, a conditional flow matching approach that trains exclusively on synthetic, time-aligned whisper-normal pairs and conditions on domain-invariant features. We exploit high-level ASR embeddings that exhibits strong invariance between synthetic and real whispered speech, enabling generalization to real whispers despite never observing it during training. We verify this invariance across ASR layers and propose a selection criterion optimizing content informativeness and cross-domain invariance. Our method achieves SOTA intelligibility on the CHAINS and wTIMIT datasets, reducing Word Error Rate by 26-46%…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Voice and Speech Disorders