Speech Separation based on Contrastive Learning and Deep Modularization

Peter Ochieng

arXiv:2305.10652·cs.SD·October 10, 2024·1 cites

Speech Separation based on Contrastive Learning and Deep Modularization

Peter Ochieng

PDF

Open Access

TL;DR

This paper introduces an unsupervised speech separation method using contrastive learning and deep modularization, effectively handling multiple speakers without labeled data and maintaining performance as speaker count increases.

Contribution

It presents a novel unsupervised approach combining contrastive learning with deep modularization for speech separation, addressing permutation and data mismatch issues.

Findings

01

Achieves SI-SNRi of 20.8 on WSJ0-2mix

02

Attains SI-SNRi of 20.7 on WSJ0-3mix

03

Performance remains stable with increasing number of speakers

Abstract

The current monaural state of the art tools for speech separation relies on supervised learning. This means that they must deal with permutation problem, they are impacted by the mismatch on the number of speakers used in training and inference. Moreover, their performance heavily relies on the presence of high-quality labelled data. These problems can be effectively addressed by employing a fully unsupervised technique for speech separation. In this paper, we use contrastive learning to establish the representations of frames then use the learned representations in the downstream deep modularization task. Concretely, we demonstrate experimentally that in speech separation, different frames of a speaker can be viewed as augmentations of a given hidden standard frame of that speaker. The frames of a speaker contain enough prosodic information overlap which is key in speech separation.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing

MethodsContrastive Learning