Target Speech Extraction: Independent Vector Extraction Guided by Supervised Speaker Identification
Jiri Malek, Jakub Jansky, Zbynek Koldovsky, Tomas Kounovsky, Jaroslav, Cmejla, Jindrich Zdansky

TL;DR
This paper introduces a robust method for extracting a target speaker from audio mixtures by combining independent vector extraction guided by deep learning-based speaker identification, with an iterative deflation process to improve accuracy in challenging scenarios.
Contribution
It presents a novel guided independent vector extraction method with an intrinsic non-intrusive quality check and iterative deflation, enhancing target speech extraction in complex acoustic environments.
Findings
Effective in challenging conditions like reverberation and noise
Outperforms state-of-the-art blind and supervised methods
Reduces incorrect extractions through iterative deflation
Abstract
This manuscript proposes a novel robust procedure for the extraction of a speaker of interest (SOI) from a mixture of audio sources. The estimation of the SOI is performed via independent vector extraction (IVE). Since the blind IVE cannot distinguish the target source by itself, it is guided towards the SOI via frame-wise speaker identification based on deep learning. Still, an incorrect speaker can be extracted due to guidance failings, especially when processing challenging data. To identify such cases, we propose a criterion for non-intrusively assessing the estimated speaker. It utilizes the same model as the speaker identification, so no additional training is required. When incorrect extraction is detected, we propose a ``deflation'' step in which the incorrect source is subtracted from the mixture and, subsequently, another attempt to extract the SOI is performed. The process is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
