Speaker-Aware Mixture of Mixtures Training for Weakly Supervised Speaker   Extraction

Zifeng Zhao; Rongzhi Gu; Dongchao Yang; Jinchuan Tian; Yuexian Zou

arXiv:2204.07375·eess.AS·April 18, 2022

Speaker-Aware Mixture of Mixtures Training for Weakly Supervised Speaker Extraction

Zifeng Zhao, Rongzhi Gu, Dongchao Yang, Jinchuan Tian, Yuexian Zou

PDF

Open Access

TL;DR

This paper introduces SAMoM, a weakly supervised training method for speaker extraction that leverages speaker identity consistency in mixed audio, enabling effective extraction without relying on clean sources and outperforming supervised methods in some scenarios.

Contribution

The paper proposes SAMoM, a novel weakly supervised training approach using mixture of mixtures and speaker identity consistency for speaker extraction.

Findings

01

Achieves 11.06dB SI-SDRi without clean sources.

02

Outperforms supervised methods in cross-domain evaluation.

03

Effective in noisy scenarios with semi-supervised setting.

Abstract

Dominant researches adopt supervised training for speaker extraction, while the scarcity of ideally clean corpus and channel mismatch problem are rarely considered. To this end, we propose speaker-aware mixture of mixtures training (SAMoM), utilizing the consistency of speaker identity among target source, enrollment utterance and target estimate to weakly supervise the training of a deep speaker extractor. In SAMoM, the input is constructed by mixing up different speaker-aware mixtures (SAMs), each contains multiple speakers with their identities known and enrollment utterances available. Informed by enrollment utterances, target speech is extracted from the input one by one, such that the estimated targets can approximate the original SAMs after a remix in accordance with the identity consistency. Moreover, using SAMoM in a semi-supervised setting with a certain amount of clean…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing