Robust fine-tuning of speech recognition models via model merging: application to disordered speech

Alexandre Ducorroy; Rachid Riad

arXiv:2505.20477·eess.AS·May 28, 2025

Robust fine-tuning of speech recognition models via model merging: application to disordered speech

Alexandre Ducorroy, Rachid Riad

PDF

Open Access

TL;DR

This paper introduces a model merging technique to enhance speech recognition accuracy for disordered speech, demonstrating significant improvements over traditional fine-tuning, especially in low-data scenarios.

Contribution

It presents a novel model merging approach for fine-tuning speech recognition models, improving generalization on dysarthric speech without extra inference costs.

Findings

01

12% relative WER reduction with multi-run merging

02

16.2% WER reduction on long-form audio

03

Effective across different model architectures

Abstract

Automatic Speech Recognition (ASR) has advanced with Speech Foundation Models (SFMs), yet performance degrades on dysarthric speech due to variability and limited data. This study as part of the submission to the Speech Accessibility challenge, explored model merging to improve ASR generalization using Whisper as the base SFM. We compared fine-tuning with single-trajectory merging, combining models from one fine-tuning path, and multi-run merging, merging independently trained models. Our best multi-run merging approach achieved a 12% relative decrease of WER over classic fine-tuning, and a 16.2% relative decrease on long-form audios, a major loss contributor in dysarthric ASR. Merging more and more models led to continuous gains, remained effective in low-data regimes, and generalized across model architectures. These results highlight model merging as an easily replicable adaptation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing

MethodsBalanced Selection