TL;DR
This paper advances multi-speaker speech separation by integrating deep clustering with end-to-end signal approximation, significantly improving SDR and WER, and enabling more effective separation in cocktail party scenarios.
Contribution
It introduces an end-to-end training framework with a new signal approximation objective that enhances deep clustering for multi-speaker separation.
Findings
SDR improved from 6.0 dB to 10.3 dB for two speakers
WER reduced from 89.1% to 30.8% with the new method
Enhanced model achieves state-of-the-art separation performance
Abstract
Deep clustering is a recently introduced deep learning architecture that uses discriminatively trained embeddings as the basis for clustering. It was recently applied to spectrogram segmentation, resulting in impressive results on speaker-independent multi-speaker separation. In this paper we extend the baseline system with an end-to-end signal approximation objective that greatly improves performance on a challenging speech separation. We first significantly improve upon the baseline system performance by incorporating better regularization, larger temporal context, and a deeper architecture, culminating in an overall improvement in signal to distortion ratio (SDR) of 10.3 dB compared to the baseline of 6.0 dB for two-speaker separation, as well as a 7.1 dB SDR improvement for three-speaker separation. We then extend the model to incorporate an enhancement layer to refine the signal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
