Transcription-Free Fine-Tuning of Speech Separation Models for Noisy and   Reverberant Multi-Speaker Automatic Speech Recognition

William Ravenscroft; George Close; Stefan Goetze; Thomas Hain,; Mohammad Soleymanpour; Anurag Chowdhury; Mark C. Fuhs

arXiv:2406.08914·cs.SD·June 14, 2024·1 cites

Transcription-Free Fine-Tuning of Speech Separation Models for Noisy and Reverberant Multi-Speaker Automatic Speech Recognition

William Ravenscroft, George Close, Stefan Goetze, Thomas Hain,, Mohammad Soleymanpour, Anurag Chowdhury, Mark C. Fuhs

PDF

Open Access

TL;DR

This paper introduces a transcription-free training approach for speech separation models that improves automatic speech recognition accuracy in noisy, reverberant multi-speaker scenarios by using embedding differences and a modified permutation invariant training.

Contribution

It presents a novel transcription-free joint training method using embedding differences and guided PIT, eliminating the need for reference transcriptions during training.

Findings

01

6.4% WER improvement over signal-level loss

02

Enhanced perceptual measures like STOI

03

Effective in noisy, reverberant multi-speaker environments

Abstract

One solution to automatic speech recognition (ASR) of overlapping speakers is to separate speech and then perform ASR on the separated signals. Commonly, the separator produces artefacts which often degrade ASR performance. Addressing this issue typically requires reference transcriptions to jointly train the separation and ASR networks. This is often not viable for training on real-world in-domain audio where reference transcript information is not always available. This paper proposes a transcription-free method for joint training using only audio signals. The proposed method uses embedding differences of pre-trained ASR encoders as a loss with a proposed modification to permutation invariant training (PIT) called guided PIT (GPIT). The method achieves a 6.4% improvement in word error rate (WER) measures over a signal-level loss and also shows enhancement improvements in perceptual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing