Heterogeneous Separation Consistency Training for Adaptation of Unsupervised Speech Separation
Jiangyu Han, Yanhua Long

TL;DR
This paper introduces a novel unsupervised speech separation training method called SCT that leverages heterogeneous models and real-world unlabeled data to improve separation performance in real scenarios.
Contribution
The study proposes a heterogeneous separation consistency training framework that iteratively refines models using pseudo labels from real mixtures and cross-knowledge adaptation.
Findings
Improved separation accuracy on real-world speech mixtures.
Effective use of unlabeled data with pseudo labeling.
Slight performance gains from linear fusion of heterogeneous outputs.
Abstract
Recently, supervised speech separation has made great progress. However, limited by the nature of supervised training, most existing separation methods require ground-truth sources and are trained on synthetic datasets. This ground-truth reliance is problematic, because the ground-truth signals are usually unavailable in real conditions. Moreover, in many industry scenarios, the real acoustic characteristics deviate far from the ones in simulated datasets. Therefore, the performance usually degrades significantly when applying the supervised speech separation models to real applications. To address these problems, in this study, we propose a novel separation consistency training, termed SCT, to exploit the real-world unlabeled mixtures for improving cross-domain unsupervised speech separation in an iterative manner, by leveraging upon the complementary information obtained from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
