TL;DR
This paper investigates using deep neural network-derived perceptual features as a loss function for speech denoising, showing improvements over baseline methods but not surpassing simpler filter bank-based losses.
Contribution
It introduces a novel approach of training audio transforms with deep feature-based perceptual losses for speech enhancement, comparing it to traditional and simpler loss functions.
Findings
Deep feature losses improve noise removal over baseline waveform reconstruction methods.
Using deep features does not outperform simpler filter bank-based losses.
Deep features can guide speech enhancement but are not yet superior to non-learned alternatives.
Abstract
Contemporary speech enhancement predominantly relies on audio transforms that are trained to reconstruct a clean speech waveform. The development of high-performing neural network sound recognition systems has raised the possibility of using deep feature representations as 'perceptual' losses with which to train denoising systems. We explored their utility by first training deep neural networks to classify either spoken words or environmental sounds from audio. We then trained an audio transform to map noisy speech to an audio waveform that minimized the difference in the deep feature representations between the output audio and the corresponding clean audio. The resulting transforms removed noise substantially better than baseline methods trained to reconstruct clean waveforms, and also outperformed previous methods using deep feature losses. However, a similar benefit was obtained…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
