Target Speech Extraction: Independent Vector Extraction Guided by   Supervised Speaker Identification

Jiri Malek; Jakub Jansky; Zbynek Koldovsky; Tomas Kounovsky; Jaroslav; Cmejla; Jindrich Zdansky

arXiv:2111.03482·eess.AS·July 29, 2022

Target Speech Extraction: Independent Vector Extraction Guided by Supervised Speaker Identification

Jiri Malek, Jakub Jansky, Zbynek Koldovsky, Tomas Kounovsky, Jaroslav, Cmejla, Jindrich Zdansky

PDF

Open Access

TL;DR

This paper introduces a robust method for extracting a target speaker from audio mixtures by combining independent vector extraction guided by deep learning-based speaker identification, with an iterative deflation process to improve accuracy in challenging scenarios.

Contribution

It presents a novel guided independent vector extraction method with an intrinsic non-intrusive quality check and iterative deflation, enhancing target speech extraction in complex acoustic environments.

Findings

01

Effective in challenging conditions like reverberation and noise

02

Outperforms state-of-the-art blind and supervised methods

03

Reduces incorrect extractions through iterative deflation

Abstract

This manuscript proposes a novel robust procedure for the extraction of a speaker of interest (SOI) from a mixture of audio sources. The estimation of the SOI is performed via independent vector extraction (IVE). Since the blind IVE cannot distinguish the target source by itself, it is guided towards the SOI via frame-wise speaker identification based on deep learning. Still, an incorrect speaker can be extracted due to guidance failings, especially when processing challenging data. To identify such cases, we propose a criterion for non-intrusively assessing the estimated speaker. It utilizes the same model as the speaker identification, so no additional training is required. When incorrect extraction is detected, we propose a ``deflation'' step in which the incorrect source is subtracted from the mixture and, subsequently, another attempt to extract the SOI is performed. The process is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing