Target Confusion in End-to-end Speaker Extraction: Analysis and   Approaches

Zifeng Zhao; Dongchao Yang; Rongzhi Gu; Haoran Zhang; Yuexian Zou

arXiv:2204.01355·eess.AS·April 5, 2022

Target Confusion in End-to-end Speaker Extraction: Analysis and Approaches

Zifeng Zhao, Dongchao Yang, Rongzhi Gu, Haoran Zhang, Yuexian Zou

PDF

Open Access

TL;DR

This paper analyzes the target confusion problem in end-to-end speaker extraction and proposes methods to improve speaker embedding distinguishability and correct extraction errors, leading to significant performance gains.

Contribution

It introduces a two-stage approach with metric learning for training and a post-filtering strategy for inference to address target confusion in speaker extraction.

Findings

01

Over 1dB SI-SDRi improvement achieved

02

Enhanced distinguishability of speaker embeddings

03

Effective correction of extraction errors

Abstract

Recently, end-to-end speaker extraction has attracted increasing attention and shown promising results. However, its performance is often inferior to that of a blind source separation (BSS) counterpart with a similar network architecture, due to the auxiliary speaker encoder may sometimes generate ambiguous speaker embeddings. Such ambiguous guidance information may confuse the separation network and hence lead to wrong extraction results, which deteriorates the overall performance. We refer to this as the target confusion problem. In this paper, we conduct an analysis of such an issue and solve it in two stages. In the training phase, we propose to integrate metric learning methods to improve the distinguishability of embeddings produced by the speaker encoder. While for inference, a novel post-filtering strategy is designed to revise the wrong results. Specifically, we first identify…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing