SLOGD: Speaker LOcation Guided Deflation approach to speech separation

Sunit Sivasankaran; Emmanuel Vincent; Dominique Fohr

arXiv:1910.11131·eess.AS·October 25, 2019·ICASSP

SLOGD: Speaker LOcation Guided Deflation approach to speech separation

Sunit Sivasankaran, Emmanuel Vincent, Dominique Fohr

PDF

TL;DR

This paper introduces SLOGD, a novel iterative speech separation method that leverages speaker localization to improve separation quality in noisy, reverberant environments, significantly reducing word error rates.

Contribution

The paper presents a new speaker localization guided deflation approach for speech separation, demonstrating improved performance over existing methods like Conv-TasNet.

Findings

01

Achieves 44.2% WER on noisy reverberant WSJ-2MIX dataset.

02

Provides 34% relative WER reduction over non-separated system.

03

Outperforms Conv-TasNet with 17% relative WER improvement.

Abstract

Speech separation is the process of separating multiple speakers from an audio recording. In this work we propose to separate the sources using a Speaker LOcalization Guided Deflation (SLOGD) approach wherein we estimate the sources iteratively. In each iteration we first estimate the location of the speaker and use it to estimate a mask corresponding to the localized speaker. The estimated source is removed from the mixture before estimating the location and mask of the next source. Experiments are conducted on a reverberated, noisy multichannel version of the well-studied WSJ-2MIX dataset using word error rate (WER) as a metric. The proposed method achieves a WER of $44.2$ %, a $34$ % relative improvement over the system without separation and $17$ % relative improvement over Conv-TasNet.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.