TL;DR
This paper introduces an end-to-end speech inpainting framework using a convolutional U-Net trained with deep feature losses, effectively recovering missing or distorted speech segments and improving objective speech quality metrics.
Contribution
The novel framework employs deep feature losses from a pre-trained speechVGG model to enhance speech inpainting performance over traditional methods.
Findings
Successfully recovered up to 400 ms of missing speech segments.
Significant improvements in STOI and PESQ metrics.
Deep feature loss training outperformed conventional approaches.
Abstract
Transient loud intrusions, often occurring in noisy environments, can completely overpower speech signal and lead to an inevitable loss of information. While existing algorithms for noise suppression can yield impressive results, their efficacy remains limited for very low signal-to-noise ratios or when parts of the signal are missing. To address these limitations, here we propose an end-to-end framework for speech inpainting, the context-based retrieval of missing or severely distorted parts of time-frequency representation of speech. The framework is based on a convolutional U-Net trained via deep feature losses, obtained using speechVGG, a deep speech feature extractor pre-trained on an auxiliary word classification task. Our evaluation results demonstrate that the proposed framework can recover large portions of missing or distorted time-frequency representation of speech, up to 400…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsConcatenated Skip Connection · *Communicated@Fast*How Do I Communicate to Expedia? · Max Pooling · Convolution · U-Net
