Frequency-Weighted Training Losses for Phoneme-Level DNN-based Speech Enhancement
Nasser-Eddine Monir, Paul Magron, Romain Serizel

TL;DR
This paper introduces frequency-weighted SDR loss functions for DNN-based speech enhancement, improving perceptual quality and phoneme preservation by emphasizing critical spectral regions during training.
Contribution
It proposes novel perceptually-informed, frequency-dependent weighting schemes for SDR loss, enhancing phoneme-level speech intelligibility in multichannel enhancement models.
Findings
Frequency-weighted SDR losses improve perceptual speech quality.
Enhanced phoneme and consonant reconstruction observed.
Spectral and phoneme-level analysis confirms better cue preservation.
Abstract
Recent advances in deep learning have significantly improved multichannel speech enhancement algorithms, yet conventional training loss functions such as the scale-invariant signal-to-distortion ratio (SDR) may fail to preserve fine-grained spectral cues essential for phoneme intelligibility. In this work, we propose perceptually-informed variants of the SDR loss, formulated in the time-frequency domain and modulated by frequency-dependent weighting schemes. These weights are designed to emphasize time-frequency regions where speech is prominent or where the interfering noise is particularly strong. We investigate both fixed and adaptive strategies, including ANSI band-importance weights, spectral magnitude-based weighting, and dynamic weighting based on the relative amount of speech and noise. We train the FaSNet multichannel speech enhancement model using these various losses.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
