Transcription-Free Fine-Tuning of Speech Separation Models for Noisy and Reverberant Multi-Speaker Automatic Speech Recognition
William Ravenscroft, George Close, Stefan Goetze, Thomas Hain,, Mohammad Soleymanpour, Anurag Chowdhury, Mark C. Fuhs

TL;DR
This paper introduces a transcription-free training approach for speech separation models that improves automatic speech recognition accuracy in noisy, reverberant multi-speaker scenarios by using embedding differences and a modified permutation invariant training.
Contribution
It presents a novel transcription-free joint training method using embedding differences and guided PIT, eliminating the need for reference transcriptions during training.
Findings
6.4% WER improvement over signal-level loss
Enhanced perceptual measures like STOI
Effective in noisy, reverberant multi-speaker environments
Abstract
One solution to automatic speech recognition (ASR) of overlapping speakers is to separate speech and then perform ASR on the separated signals. Commonly, the separator produces artefacts which often degrade ASR performance. Addressing this issue typically requires reference transcriptions to jointly train the separation and ASR networks. This is often not viable for training on real-world in-domain audio where reference transcript information is not always available. This paper proposes a transcription-free method for joint training using only audio signals. The proposed method uses embedding differences of pre-trained ASR encoders as a loss with a proposed modification to permutation invariant training (PIT) called guided PIT (GPIT). The method achieves a 6.4% improvement in word error rate (WER) measures over a signal-level loss and also shows enhancement improvements in perceptual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing
