SLOGD: Speaker LOcation Guided Deflation approach to speech separation
Sunit Sivasankaran, Emmanuel Vincent, Dominique Fohr

TL;DR
This paper introduces SLOGD, a novel iterative speech separation method that leverages speaker localization to improve separation quality in noisy, reverberant environments, significantly reducing word error rates.
Contribution
The paper presents a new speaker localization guided deflation approach for speech separation, demonstrating improved performance over existing methods like Conv-TasNet.
Findings
Achieves 44.2% WER on noisy reverberant WSJ-2MIX dataset.
Provides 34% relative WER reduction over non-separated system.
Outperforms Conv-TasNet with 17% relative WER improvement.
Abstract
Speech separation is the process of separating multiple speakers from an audio recording. In this work we propose to separate the sources using a Speaker LOcalization Guided Deflation (SLOGD) approach wherein we estimate the sources iteratively. In each iteration we first estimate the location of the speaker and use it to estimate a mask corresponding to the localized speaker. The estimated source is removed from the mixture before estimating the location and mask of the next source. Experiments are conducted on a reverberated, noisy multichannel version of the well-studied WSJ-2MIX dataset using word error rate (WER) as a metric. The proposed method achieves a WER of %, a % relative improvement over the system without separation and % relative improvement over Conv-TasNet.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
