Quantitative Evidence on Overlooked Aspects of Enrollment Speaker   Embeddings for Target Speaker Separation

Xiaoyu Liu; Xu Li; Joan Serr\`a

arXiv:2210.12635·cs.SD·October 27, 2022·1 cites

Quantitative Evidence on Overlooked Aspects of Enrollment Speaker Embeddings for Target Speaker Separation

Xiaoyu Liu, Xu Li, Joan Serr\`a

PDF

Open Access

TL;DR

This paper investigates overlooked aspects of enrollment speaker embeddings in target speaker separation, highlighting the effectiveness of filterbank embeddings over self-supervised ones and questioning the suitability of speaker identification embeddings.

Contribution

It introduces a comprehensive analysis of various enrollment embeddings, revealing the superiority of filterbank embeddings for cross-dataset generalization in TSS.

Findings

01

Filterbank embeddings outperform self-supervised embeddings in cross-dataset tests.

02

Speaker identification embeddings may lose relevant information due to sub-optimal metrics or training objectives.

03

Filterbank embeddings consistently show competitive separation and generalization performance.

Abstract

Single channel target speaker separation (TSS) aims at extracting a speaker's voice from a mixture of multiple talkers given an enrollment utterance of that speaker. A typical deep learning TSS framework consists of an upstream model that obtains enrollment speaker embeddings and a downstream model that performs the separation conditioned on the embeddings. In this paper, we look into several important but overlooked aspects of the enrollment embeddings, including the suitability of the widely used speaker identification embeddings, the introduction of the log-mel filterbank and self-supervised embeddings, and the embeddings' cross-dataset generalization capability. Our results show that the speaker identification embeddings could lose relevant information due to a sub-optimal metric, training objective, or common pre-processing. In contrast, both the filterbank and the self-supervised…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing