Heterogeneous Target Speech Separation
Efthymios Tzinis, Gordon Wichern, Aswin Subramanian, Paris Smaragdis,, Jonathan Le Roux

TL;DR
This paper presents a heterogeneous target speech separation framework that leverages diverse datasets and concepts, improving generalization and robustness in single-channel source separation tasks.
Contribution
It introduces a novel heterogeneous separation paradigm that utilizes cross-domain concepts and datasets, enhancing generalization and robustness over traditional single-domain models.
Findings
Models trained with heterogeneous conditions outperform single-domain models.
The approach improves generalization to unseen concepts and outperforms permutation invariant training.
The method enhances robustness in challenging separation scenarios.
Abstract
We introduce a new paradigm for single-channel target source separation where the sources of interest can be distinguished using non-mutually exclusive concepts (e.g., loudness, gender, language, spatial location, etc). Our proposed heterogeneous separation framework can seamlessly leverage datasets with large distribution shifts and learn cross-domain representations under a variety of concepts used as conditioning. Our experiments show that training separation models with heterogeneous conditions facilitates the generalization to new concepts with unseen out-of-domain data while also performing substantially higher than single-domain specialist models. Notably, such training leads to more robust learning of new harder source separation discriminative concepts and can yield improvements over permutation invariant training with oracle source selection. We analyze the intrinsic behavior…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
