Speech Separation based on Contrastive Learning and Deep Modularization
Peter Ochieng

TL;DR
This paper introduces an unsupervised speech separation method using contrastive learning and deep modularization, effectively handling multiple speakers without labeled data and maintaining performance as speaker count increases.
Contribution
It presents a novel unsupervised approach combining contrastive learning with deep modularization for speech separation, addressing permutation and data mismatch issues.
Findings
Achieves SI-SNRi of 20.8 on WSJ0-2mix
Attains SI-SNRi of 20.7 on WSJ0-3mix
Performance remains stable with increasing number of speakers
Abstract
The current monaural state of the art tools for speech separation relies on supervised learning. This means that they must deal with permutation problem, they are impacted by the mismatch on the number of speakers used in training and inference. Moreover, their performance heavily relies on the presence of high-quality labelled data. These problems can be effectively addressed by employing a fully unsupervised technique for speech separation. In this paper, we use contrastive learning to establish the representations of frames then use the learned representations in the downstream deep modularization task. Concretely, we demonstrate experimentally that in speech separation, different frames of a speaker can be viewed as augmentations of a given hidden standard frame of that speaker. The frames of a speaker contain enough prosodic information overlap which is key in speech separation.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
MethodsContrastive Learning
