Speech Denoising in the Waveform Domain with Self-Attention
Zhifeng Kong, Wei Ping, Ambrish Dantrey, Bryan Catanzaro

TL;DR
This paper introduces CleanUNet, a causal waveform-based speech denoising model utilizing self-attention and multi-resolution losses, achieving superior speech quality over existing methods.
Contribution
The paper presents a novel encoder-decoder model with self-attention for waveform speech denoising, improving over prior models in quality and effectiveness.
Findings
Outperforms state-of-the-art denoising models on multiple metrics
Uses self-attention to refine bottleneck representations
Optimized with multi-resolution spectrogram losses
Abstract
In this work, we present CleanUNet, a causal speech denoising model on the raw waveform. The proposed model is based on an encoder-decoder architecture combined with several self-attention blocks to refine its bottleneck representations, which is crucial to obtain good results. The model is optimized through a set of losses defined over both waveform and multi-resolution spectrograms. The proposed method outperforms the state-of-the-art models in terms of denoised speech quality from various objective and subjective evaluation metrics. We release our code and models at https://github.com/nvidia/cleanunet.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Advanced Data Compression Techniques
