Resource-Efficient Adaptation of Speech Foundation Models for Multi-Speaker ASR
Weiqing Wang, Kunal Dhawan, Taejin Park, Krishna C. Puvvada, Ivan, Medennikov, Somshubra Majumdar, He Huang, Jagadeesh Balam, Boris Ginsburg

TL;DR
This paper introduces methods to adapt speech foundation models for multi-speaker automatic speech recognition using limited data, achieving good generalization and improved performance with fewer parameters.
Contribution
We propose a novel adaptation approach for speech foundation models to handle multi-speaker ASR with minimal training data, demonstrating strong generalization without fine-tuning.
Findings
Less parameters lead to better cpWER performance
The adapted model generalizes well to meeting data
Counter-intuitive results highlight the importance of parameter efficiency
Abstract
Speech foundation models have achieved state-of-the-art (SoTA) performance across various tasks, such as automatic speech recognition (ASR) in hundreds of languages. However, multi-speaker ASR remains a challenging task for these models due to data scarcity and sparsity. In this paper, we present approaches to enable speech foundation models to process and understand multi-speaker speech with limited training data. Specifically, we adapt a speech foundation model for the multi-speaker ASR task using only telephonic data. Remarkably, the adapted model also performs well on meeting data without any fine-tuning, demonstrating the generalization ability of our approach. We conduct several ablation studies to analyze the impact of different parameters and strategies on model performance. Our findings highlight the effectiveness of our methods. Results show that less parameters give better…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing
