Resource-Efficient Adaptation of Speech Foundation Models for   Multi-Speaker ASR

Weiqing Wang; Kunal Dhawan; Taejin Park; Krishna C. Puvvada; Ivan; Medennikov; Somshubra Majumdar; He Huang; Jagadeesh Balam; Boris Ginsburg

arXiv:2409.01438·eess.AS·December 4, 2024

Resource-Efficient Adaptation of Speech Foundation Models for Multi-Speaker ASR

Weiqing Wang, Kunal Dhawan, Taejin Park, Krishna C. Puvvada, Ivan, Medennikov, Somshubra Majumdar, He Huang, Jagadeesh Balam, Boris Ginsburg

PDF

Open Access

TL;DR

This paper introduces methods to adapt speech foundation models for multi-speaker automatic speech recognition using limited data, achieving good generalization and improved performance with fewer parameters.

Contribution

We propose a novel adaptation approach for speech foundation models to handle multi-speaker ASR with minimal training data, demonstrating strong generalization without fine-tuning.

Findings

01

Less parameters lead to better cpWER performance

02

The adapted model generalizes well to meeting data

03

Counter-intuitive results highlight the importance of parameter efficiency

Abstract

Speech foundation models have achieved state-of-the-art (SoTA) performance across various tasks, such as automatic speech recognition (ASR) in hundreds of languages. However, multi-speaker ASR remains a challenging task for these models due to data scarcity and sparsity. In this paper, we present approaches to enable speech foundation models to process and understand multi-speaker speech with limited training data. Specifically, we adapt a speech foundation model for the multi-speaker ASR task using only telephonic data. Remarkably, the adapted model also performs well on meeting data without any fine-tuning, demonstrating the generalization ability of our approach. We conduct several ablation studies to analyze the impact of different parameters and strategies on model performance. Our findings highlight the effectiveness of our methods. Results show that less parameters give better…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing