WhispSynth: Scaling Multilingual Whisper Corpus through Real Data Curation and A Novel Pitch-free Generative Framework

Tianyi Tan; Jiaxin Ye; Yuanming Zhang; Xiaohuai Le; Xianjun Xia; Chuanzeng Huang; Jing Lu

arXiv:2603.14853·cs.SD·March 17, 2026

WhispSynth: Scaling Multilingual Whisper Corpus through Real Data Curation and A Novel Pitch-free Generative Framework

Tianyi Tan, Jiaxin Ye, Yuanming Zhang, Xiaohuai Le, Xianjun Xia, Chuanzeng Huang, Jing Lu

PDF

Open Access

TL;DR

WhispSynth is a large-scale, high-fidelity multilingual whispered speech corpus created through a novel generative framework combining DDSP-based pitch-free methods with TTS models, enabling improved whisper synthesis research.

Contribution

The paper introduces WhispSynth, a new high-quality whispered speech dataset generated with a novel pipeline that preserves vocal timbre and linguistic content, advancing multilingual whisper synthesis.

Findings

01

WhispSynth outperforms existing whispered speech corpora in quality.

02

CosyWhisper achieves naturalness comparable to real whispered speech.

03

The framework effectively combines DDSP and TTS for high-fidelity whisper data.

Abstract

Whisper generation is constrained by the difficulty of data collection. Because whispered speech has low acoustic amplitude, high-fidelity recording is challenging. In this paper, we introduce WhispSynth, a large-scale multilingual corpus constructed via a novel high-fidelity generative framework. Specifically, we propose a pipeline integrating Differentiable Digital Signal Processing (DDSP)-based pitch-free method with Text-to-Speech (TTS) models. This framework refines a comprehensive collection of resources, including our newly constructed WhispNJU dataset, into 118 hours of high-fidelity whispered speech from 479 speakers. Unlike standard synthetic or noisy real data, our data engine faithfully preserves source vocal timbre and linguistic content while ensuring acoustic consistency, providing a robust foundation for text-to-whisper research. Experimental results demonstrate that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Generative Adversarial Networks and Image Synthesis