MC-SpEx: Towards Effective Speaker Extraction with Multi-Scale Interfusion and Conditional Speaker Modulation
Jun Chen, Wei Rao, Zilin Wang, Jiuxin Lin, Yukai Ju, Shulin He, Yannan, Wang, Zhiyong Wu

TL;DR
This paper introduces MC-SpEx, a novel speaker extraction system that effectively utilizes multi-scale information and speaker embeddings through innovative modules, achieving state-of-the-art results on Libri2Mix.
Contribution
The paper proposes a new speaker extraction model with multi-scale interfusion and conditional speaker modulation, improving upon prior methods by better leveraging multi-scale features and speaker embeddings.
Findings
Achieves state-of-the-art performance on Libri2Mix
Effectively utilizes multi-scale information through ScaleFusers and ScaleInterMG
Fully exploits speaker embeddings with ConSM module
Abstract
The previous SpEx+ has yielded outstanding performance in speaker extraction and attracted much attention. However, it still encounters inadequate utilization of multi-scale information and speaker embedding. To this end, this paper proposes a new effective speaker extraction system with multi-scale interfusion and conditional speaker modulation (ConSM), which is called MC-SpEx. First of all, we design the weight-share multi-scale fusers (ScaleFusers) for efficiently leveraging multi-scale information as well as ensuring consistency of the model's feature space. Then, to consider different scale information while generating masks, the multi-scale interactive mask generator (ScaleInterMG) is presented. Moreover, we introduce ConSM module to fully exploit speaker embedding in the speech extractor. Experimental results on the Libri2Mix dataset demonstrate the effectiveness of our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
