Speech separation with large-scale self-supervised learning

Zhuo Chen; Naoyuki Kanda; Jian Wu; Yu Wu; Xiaofei Wang; Takuya; Yoshioka; Jinyu Li; Sunit Sivasankaran; Sefik Emre Eskimez

arXiv:2211.05172·eess.AS·November 29, 2022

Speech separation with large-scale self-supervised learning

Zhuo Chen, Naoyuki Kanda, Jian Wu, Yu Wu, Xiaofei Wang, Takuya, Yoshioka, Jinyu Li, Sunit Sivasankaran, Sefik Emre Eskimez

PDF

Open Access

TL;DR

This paper demonstrates that large-scale self-supervised learning significantly improves speech separation and transcription accuracy, reducing word error rates on simulated and real meeting data while maintaining computational efficiency.

Contribution

It introduces a large-scale SSL-based speech separation model trained on over 300K hours of data, with techniques for efficient integration and fine-tuning that outperform supervised baselines.

Findings

01

Achieves up to 15.9% relative WER reduction on simulated data.

02

Reduces WER by up to 10.6% on real meeting recordings.

03

Reduces computational cost by 38% compared to baseline.

Abstract

Self-supervised learning (SSL) methods such as WavLM have shown promising speech separation (SS) results in small-scale simulation-based experiments. In this work, we extend the exploration of the SSL-based SS by massively scaling up both the pre-training data (more than 300K hours) and fine-tuning data (10K hours). We also investigate various techniques to efficiently integrate the pre-trained model with the SS network under a limited computation budget, including a low frame rate SSL model training setup and a fine-tuning scheme using only the part of the pre-trained model. Compared with a supervised baseline and the WavLM-based SS model using feature embeddings obtained with the previously released 94K hours trained WavLM, our proposed model obtains 15.9% and 11.2% of relative word error rate (WER) reductions, respectively, for a simulated far-field speech mixture test set. For…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Phonetics and Phonology Research

MethodsTest