Cascaded encoders for fine-tuning ASR models on overlapped speech

Richard Rose; Oscar Chang; Olivier Siohan

arXiv:2306.16398·cs.SD·June 29, 2023

Cascaded encoders for fine-tuning ASR models on overlapped speech

Richard Rose, Oscar Chang, Olivier Siohan

PDF

Open Access

TL;DR

This paper introduces a cascaded encoder approach combining foundation and multi-talker models to improve speech recognition accuracy on overlapping speech without losing performance on clean speech.

Contribution

It proposes a novel cascaded RNN-T encoder architecture that leverages foundation models for multi-talker speech recognition, enhancing performance on overlapped speech.

Findings

01

Improved WER on overlapping speech utterances.

02

Maintains baseline performance on non-overlapping speech.

03

Cascaded model outperforms baseline multi-talker models.

Abstract

Multi-talker speech recognition (MT-ASR) has been shown to improve ASR performance on speech containing overlapping utterances from more than one speaker. Multi-talker models have typically been trained from scratch using simulated or actual overlapping speech datasets. On the other hand, the trend in ASR has been to train foundation models using massive datasets collected from a wide variety of task domains. Given the scale of these models and their ability to generalize well across a variety of domains, it makes sense to consider scenarios where a foundation model is augmented with multi-talker capability. This paper presents an MT-ASR model formed by combining a well-trained foundation model with a multi-talker mask model in a cascaded RNN-T encoder configuration. Experimental results show that the cascade configuration provides improved WER on overlapping speech utterances with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems