Ring Mixing with Auxiliary Signal-to-Consistency-Error Ratio Loss for Unsupervised Denoising in Speech Separation
Matthew Maciejewski, Samuele Cornell

TL;DR
This paper introduces ring mixing and a new SCER loss to improve unsupervised speech denoising, enabling models to better generalize to real-world noisy speech without clean references.
Contribution
The paper proposes a novel batch strategy and auxiliary loss that break symmetry in training, leading to significant noise reduction in speech separation.
Findings
Reduces residual noise by over 50% on WHAM! benchmark.
Enables training of denoising systems using only noisy in-the-wild data.
Improves generalization to real-world noisy speech scenarios.
Abstract
Noisy speech separation systems are typically trained on fully-synthetic mixtures, limiting generalization to real-world scenarios. Though training on mixtures of in-domain (thus often noisy) speech is possible, we show that this leads to undesirable optima where mixture noise is retained in the estimates, due to the inseparability of the background noises and the loss function's symmetry. To address this, we propose ring mixing, a batch strategy of using each source in two mixtures, alongside a new Signal-to-Consistency-Error Ratio (SCER) auxiliary loss penalizing inconsistent estimates of the same source from different mixtures, breaking symmetry and incentivizing denoising. On a WHAM!-based benchmark, our method can reduce residual noise by upwards of half, effectively learning to denoise from only noisy recordings. This opens the door to training more generalizable systems using…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
