MC-SpEx: Towards Effective Speaker Extraction with Multi-Scale   Interfusion and Conditional Speaker Modulation

Jun Chen; Wei Rao; Zilin Wang; Jiuxin Lin; Yukai Ju; Shulin He; Yannan; Wang; Zhiyong Wu

arXiv:2306.16250·cs.SD·June 29, 2023·1 cites

MC-SpEx: Towards Effective Speaker Extraction with Multi-Scale Interfusion and Conditional Speaker Modulation

Jun Chen, Wei Rao, Zilin Wang, Jiuxin Lin, Yukai Ju, Shulin He, Yannan, Wang, Zhiyong Wu

PDF

Open Access

TL;DR

This paper introduces MC-SpEx, a novel speaker extraction system that effectively utilizes multi-scale information and speaker embeddings through innovative modules, achieving state-of-the-art results on Libri2Mix.

Contribution

The paper proposes a new speaker extraction model with multi-scale interfusion and conditional speaker modulation, improving upon prior methods by better leveraging multi-scale features and speaker embeddings.

Findings

01

Achieves state-of-the-art performance on Libri2Mix

02

Effectively utilizes multi-scale information through ScaleFusers and ScaleInterMG

03

Fully exploits speaker embeddings with ConSM module

Abstract

The previous SpEx+ has yielded outstanding performance in speaker extraction and attracted much attention. However, it still encounters inadequate utilization of multi-scale information and speaker embedding. To this end, this paper proposes a new effective speaker extraction system with multi-scale interfusion and conditional speaker modulation (ConSM), which is called MC-SpEx. First of all, we design the weight-share multi-scale fusers (ScaleFusers) for efficiently leveraging multi-scale information as well as ensuring consistency of the model's feature space. Then, to consider different scale information while generating masks, the multi-scale interactive mask generator (ScaleInterMG) is presented. Moreover, we introduce ConSM module to fully exploit speaker embedding in the speech extractor. Experimental results on the Libri2Mix dataset demonstrate the effectiveness of our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing