Cascaded encoders for fine-tuning ASR models on overlapped speech
Richard Rose, Oscar Chang, Olivier Siohan

TL;DR
This paper introduces a cascaded encoder approach combining foundation and multi-talker models to improve speech recognition accuracy on overlapping speech without losing performance on clean speech.
Contribution
It proposes a novel cascaded RNN-T encoder architecture that leverages foundation models for multi-talker speech recognition, enhancing performance on overlapped speech.
Findings
Improved WER on overlapping speech utterances.
Maintains baseline performance on non-overlapping speech.
Cascaded model outperforms baseline multi-talker models.
Abstract
Multi-talker speech recognition (MT-ASR) has been shown to improve ASR performance on speech containing overlapping utterances from more than one speaker. Multi-talker models have typically been trained from scratch using simulated or actual overlapping speech datasets. On the other hand, the trend in ASR has been to train foundation models using massive datasets collected from a wide variety of task domains. Given the scale of these models and their ability to generalize well across a variety of domains, it makes sense to consider scenarios where a foundation model is augmented with multi-talker capability. This paper presents an MT-ASR model formed by combining a well-trained foundation model with a multi-talker mask model in a cascaded RNN-T encoder configuration. Experimental results show that the cascade configuration provides improved WER on overlapping speech utterances with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems
