FINALLY: fast and universal speech enhancement with studio-like quality

Nicholas Babaev; Kirill Tamogashev; Azat Saginbaev; Ivan Shchekotov,; Hanbin Bae; Hosang Sung; WonJun Lee; Hoon-Young Cho; Pavel Andreev

arXiv:2410.05920·cs.SD·November 1, 2024·3 cites

FINALLY: fast and universal speech enhancement with studio-like quality

Nicholas Babaev, Kirill Tamogashev, Azat Saginbaev, Ivan Shchekotov,, Hanbin Bae, Hosang Sung, WonJun Lee, Hoon-Young Cho, Pavel Andreev

PDF

Open Access 1 Video

TL;DR

This paper introduces FINALLY, a speech enhancement model that combines GANs with perceptual loss and a WavLM encoder, achieving real-time, high-quality speech restoration with state-of-the-art results.

Contribution

The paper presents a novel training pipeline integrating WavLM perceptual loss with GANs, improving stability and quality in speech enhancement.

Findings

01

Achieves state-of-the-art speech enhancement quality at 48 kHz

02

Demonstrates stability of GAN training with perceptual loss

03

Produces clear, studio-like speech from distorted recordings

Abstract

In this paper, we address the challenge of speech enhancement in real-world recordings, which often contain various forms of distortion, such as background noise, reverberation, and microphone artifacts. We revisit the use of Generative Adversarial Networks (GANs) for speech enhancement and theoretically show that GANs are naturally inclined to seek the point of maximum density within the conditional clean speech distribution, which, as we argue, is essential for the speech enhancement task. We study various feature extractors for perceptual loss to facilitate the stability of adversarial training, developing a methodology for probing the structure of the feature space. This leads us to integrate WavLM-based perceptual loss into MS-STFT adversarial training pipeline, creating an effective and stable training procedure for the speech enhancement model. The resulting speech enhancement…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

FINALLY: fast and universal speech enhancement with studio-like quality· slideslive

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Face recognition and analysis