TL;DR
This paper introduces a self-supervised, domain-invariant pretrained frontend for speech separation that reduces the domain gap between synthetic training data and real-world applications, improving separation quality.
Contribution
A novel DIP frontend with mixture predictive and invariant coding tasks that captures shared cues, enabling better transfer of speech separation skills from synthetic to real data.
Findings
DIP frontend outperforms existing models on standard benchmarks.
Pretraining improves speech separation quality in real-world scenarios.
The approach effectively reduces domain mismatch in speech separation.
Abstract
Speech separation seeks to separate individual speech signals from a speech mixture. Typically, most separation models are trained on synthetic data due to the unavailability of target reference in real-world cocktail party scenarios. As a result, there exists a domain gap between real and synthetic data when deploying speech separation models in real-world applications. In this paper, we propose a self-supervised domain-invariant pretrained (DIP) frontend that is exposed to mixture data without the need for target reference speech. The DIP frontend utilizes a Siamese network with two innovative pretext tasks, mixture predictive coding (MPC) and mixture invariant coding (MIC), to capture shared contextual cues between real and synthetic unlabeled mixtures. Subsequently, we freeze the DIP frontend as a feature extractor when training the downstream speech separation models on synthetic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSiamese Network · ALIGN
