Robust fine-tuning of speech recognition models via model merging: application to disordered speech
Alexandre Ducorroy, Rachid Riad

TL;DR
This paper introduces a model merging technique to enhance speech recognition accuracy for disordered speech, demonstrating significant improvements over traditional fine-tuning, especially in low-data scenarios.
Contribution
It presents a novel model merging approach for fine-tuning speech recognition models, improving generalization on dysarthric speech without extra inference costs.
Findings
12% relative WER reduction with multi-run merging
16.2% WER reduction on long-form audio
Effective across different model architectures
Abstract
Automatic Speech Recognition (ASR) has advanced with Speech Foundation Models (SFMs), yet performance degrades on dysarthric speech due to variability and limited data. This study as part of the submission to the Speech Accessibility challenge, explored model merging to improve ASR generalization using Whisper as the base SFM. We compared fine-tuning with single-trajectory merging, combining models from one fine-tuning path, and multi-run merging, merging independently trained models. Our best multi-run merging approach achieved a 12% relative decrease of WER over classic fine-tuning, and a 16.2% relative decrease on long-form audios, a major loss contributor in dysarthric ASR. Merging more and more models led to continuous gains, remained effective in low-data regimes, and generalized across model architectures. These results highlight model merging as an easily replicable adaptation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing
MethodsBalanced Selection
