Building Corpora for Single-Channel Speech Separation Across Multiple Domains
Matthew Maciejewski, Gregory Sell, Leibny Paola Garcia-Perera, Shinji, Watanabe, Sanjeev Khudanpur

TL;DR
This paper develops a method to create realistic synthetic datasets for single-channel speech separation, highlighting the limitations of current models and emphasizing the importance of diverse training data for robustness across different scenarios.
Contribution
It introduces a procedure for building high-quality synthetic overlap datasets from existing corpora, improving the realism and diversity of training data for speech separation models.
Findings
Current models underperform on realistic datasets
Diverse training data improves model robustness
Synthetic datasets can better represent real-world conditions
Abstract
To date, the bulk of research on single-channel speech separation has been conducted using clean, near-field, read speech, which is not representative of many modern applications. In this work, we develop a procedure for constructing high-quality synthetic overlap datasets, necessary for most deep learning-based separation frameworks. We produced datasets that are more representative of realistic applications using the CHiME-5 and Mixer 6 corpora and evaluate standard methods on this data to demonstrate the shortcomings of current source-separation performance. We also demonstrate the value of a wide variety of data in training robust models that generalize well to multiple conditions.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Advanced Adaptive Filtering Techniques
