End-to-End Multi-Task Denoising for joint SDR and PESQ Optimization
Jaeyoung Kim, Mostafa El-Khamy, Jungwon Lee

TL;DR
This paper introduces an end-to-end speech denoising method that jointly optimizes for SDR and PESQ metrics by operating in the time domain and using specialized loss functions, leading to improved speech quality.
Contribution
It proposes a novel end-to-end framework that addresses spectrum and metric mismatches by optimizing directly in the time domain with new loss functions for SDR and PESQ.
Findings
Significant SDR and PESQ improvements over existing methods.
Effective mitigation of spectrum and metric mismatches.
Enhanced speech quality in denoising tasks.
Abstract
Supervised learning based on a deep neural network recently has achieved substantial improvement on speech enhancement. Denoising networks learn mapping from noisy speech to clean one directly, or to a spectrum mask which is the ratio between clean and noisy spectra. In either case, the network is optimized by minimizing mean square error (MSE) between ground-truth labels and time-domain or spectrum output. However, existing schemes have either of two critical issues: spectrum and metric mismatches. The spectrum mismatch is a well known issue that any spectrum modification after short-time Fourier transform (STFT), in general, cannot be fully recovered after inverse short-time Fourier transform (ISTFT). The metric mismatch is that a conventional MSE metric is sub-optimal to maximize our target metrics, signal-to-distortion ratio (SDR) and perceptual evaluation of speech quality (PESQ).…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Advanced Adaptive Filtering Techniques · Music and Audio Processing
