Time-Domain Speech Extraction with Spatial Information and Multi Speaker   Conditioning Mechanism

Jisi Zhang; Catalin Zorila; Rama Doddipatla; Jon Barker

arXiv:2102.03762·eess.AS·June 17, 2021

Time-Domain Speech Extraction with Spatial Information and Multi Speaker Conditioning Mechanism

Jisi Zhang, Catalin Zorila, Rama Doddipatla, Jon Barker

PDF

TL;DR

This paper introduces a multi-channel time-domain speech extraction system that uses spatial information and speaker embeddings to improve separation of multiple speakers in noisy, reverberant environments, enhancing speech recognition accuracy.

Contribution

It proposes a novel speaker conditioning mechanism with an additional speaker branch, enabling effective multi-speaker extraction without label permutation ambiguity.

Findings

01

Achieved 9% relative improvement in source separation performance.

02

Increased speech recognition accuracy by over 16%.

03

Demonstrated effectiveness on 2-channel WHAMR! data.

Abstract

In this paper, we present a novel multi-channel speech extraction system to simultaneously extract multiple clean individual sources from a mixture in noisy and reverberant environments. The proposed method is built on an improved multi-channel time-domain speech separation network which employs speaker embeddings to identify and extract multiple targets without label permutation ambiguity. To efficiently inform the speaker information to the extraction model, we propose a new speaker conditioning mechanism by designing an additional speaker branch for receiving external speaker embeddings. Experiments on 2-channel WHAMR! data show that the proposed system improves by 9% relative the source separation performance over a strong multi-channel baseline, and it increases the speech recognition accuracy by more than 16% relative over the same baseline.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.