SE-MelGAN -- Speaker Agnostic Rapid Speech Enhancement

Luka Chkhetiani; Levan Bejanidze

arXiv:2006.07637·eess.AS·June 16, 2020·1 cites

SE-MelGAN -- Speaker Agnostic Rapid Speech Enhancement

Luka Chkhetiani, Levan Bejanidze

PDF

Open Access

TL;DR

This paper introduces SE-MelGAN, a robust, speaker-agnostic speech enhancement model based on GANs that generalizes well to unseen noises, improves quality and speed over previous methods, and operates in real-time without hardware optimization.

Contribution

It adapts MelGAN's robustness to speech enhancement, demonstrating multi-speaker generalization, noise robustness, and faster convergence without model modifications.

Findings

01

Outperforms SEGAN in quality and speed.

02

Operates at over 100x real-time speed on GPU.

03

Handles unseen background noises effectively.

Abstract

Recent advancement in Generative Adversarial Networks in speech synthesis domain[3],[2] have shown, that it's possible to train GANs [8] in a reliable manner for high quality coherent waveform generation from mel-spectograms. We propose that it is possible to transfer the MelGAN's [3] robustness in learning speech features to speech enhancement and noise reduction domain without any model modification tasks. Our proposed method generalizes over multi-speaker speech dataset and is able to robustly handle unseen background noises during the inference. Also, we show that by increasing the batch size for this particular approach not only yields better speech results, but generalizes over multi-speaker dataset easily and leads to faster convergence. Additionally, it outperforms previous state of the art GAN approach for speech enhancement SEGAN [5] in two domains: 1. quality ; 2. speed.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · 1x1 Convolution · Residual Connection · Tanh Activation · GAN Hinge Loss · Weight Normalization · Average Pooling · Convolution · HuMan(Expedia)||How do I get a human at Expedia? · Dilated Convolution