Speech separation with large-scale self-supervised learning
Zhuo Chen, Naoyuki Kanda, Jian Wu, Yu Wu, Xiaofei Wang, Takuya, Yoshioka, Jinyu Li, Sunit Sivasankaran, Sefik Emre Eskimez

TL;DR
This paper demonstrates that large-scale self-supervised learning significantly improves speech separation and transcription accuracy, reducing word error rates on simulated and real meeting data while maintaining computational efficiency.
Contribution
It introduces a large-scale SSL-based speech separation model trained on over 300K hours of data, with techniques for efficient integration and fine-tuning that outperform supervised baselines.
Findings
Achieves up to 15.9% relative WER reduction on simulated data.
Reduces WER by up to 10.6% on real meeting recordings.
Reduces computational cost by 38% compared to baseline.
Abstract
Self-supervised learning (SSL) methods such as WavLM have shown promising speech separation (SS) results in small-scale simulation-based experiments. In this work, we extend the exploration of the SSL-based SS by massively scaling up both the pre-training data (more than 300K hours) and fine-tuning data (10K hours). We also investigate various techniques to efficiently integrate the pre-trained model with the SS network under a limited computation budget, including a low frame rate SSL model training setup and a fine-tuning scheme using only the part of the pre-trained model. Compared with a supervised baseline and the WavLM-based SS model using feature embeddings obtained with the previously released 94K hours trained WavLM, our proposed model obtains 15.9% and 11.2% of relative word error rate (WER) reductions, respectively, for a simulated far-field speech mixture test set. For…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Phonetics and Phonology Research
MethodsTest
