Generative Speech Foundation Model Pretraining for High-Quality Speech   Extraction and Restoration

Pin-Jui Ku; Alexander H. Liu; Roman Korostik; Sung-Feng Huang; Szu-Wei; Fu; Ante Juki\'c

arXiv:2409.16117·eess.AS·September 26, 2024

Generative Speech Foundation Model Pretraining for High-Quality Speech Extraction and Restoration

Pin-Jui Ku, Alexander H. Liu, Roman Korostik, Sung-Feng Huang, Szu-Wei, Fu, Ante Juki\'c

PDF

Open Access 1 Repo 1 Models

TL;DR

This paper introduces a generative speech foundation model that operates directly on Fourier coefficients, enabling high-quality speech restoration without vocoders, outperforming existing methods across multiple tasks.

Contribution

The paper presents a novel generative pretraining approach for speech restoration that eliminates reliance on vocoders and achieves superior results on various tasks.

Findings

01

Outperforms strong baselines in speech denoising, bandwidth extension, and artifact removal.

02

Achieves state-of-the-art results in target speaker extraction, surpassing SSL-based systems.

03

Operates directly on complex Fourier coefficients, simplifying synthesis and improving quality.

Abstract

This paper proposes a generative pretraining foundation model for high-quality speech restoration tasks. By directly operating on complex-valued short-time Fourier transform coefficients, our model does not rely on any vocoders for time-domain signal reconstruction. As a result, our model simplifies the synthesis process and removes the quality upper-bound introduced by any mel-spectrogram vocoder compared to prior work SpeechFlow. The proposed method is evaluated on multiple speech restoration tasks, including speech denoising, bandwidth extension, codec artifact removal, and target speaker extraction. In all scenarios, finetuning our pretrained model results in superior performance over strong baselines. Notably, in the target speaker extraction task, our model outperforms existing systems, including those leveraging SSL-pretrained encoders like WavLM. The code and the pretrained…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

NVIDIA/NeMo
pytorchOfficial

Models

🤗
nvidia/sr_ssl_flowmatching_16k_430m
model· ♡ 9
♡ 9

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis