Self-Supervised Learning-Based Source Separation for Meeting Data
Yuang Li, Xianrui Zheng, Philip C. Woodland

TL;DR
This paper evaluates self-supervised learning models for source separation in meeting scenarios, proposing a novel integration method with ASR and demonstrating improved transcription accuracy on real-world data.
Contribution
It compares seven SSL models on real and simulated data, introduces an iterative source selection method, and adapts training techniques for better real-world performance.
Findings
Improved cpWER-us by 1.9% on AMI dev set
Improved cpWER-us by 1.5% on AMI test set
Demonstrated effectiveness of the proposed source separation approach
Abstract
Source separation can improve automatic speech recognition (ASR) under multi-party meeting scenarios by extracting single-speaker signals from overlapped speech. Despite the success of self-supervised learning models in single-channel source separation, most studies have focused on simulated setups. In this paper, seven SSL models were compared on both simulated and real-world corpora. Then, we propose to integrate the best-performing model WavLM into an automatic transcription system through a novel iterative source selection method. To improve real-world performance, time-domain unsupervised mixture invariant training was adapted to the time-frequency domain. Experiments showed that in the transcription system when source separation was inserted before an ASR model fine-tuned on separated speech, absolute reductions of 1.9% and 1.5% in concatenated minimum-permutation word error rate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsTest
