Exploring the potential and limitations of Model Merging for Multi-Domain Adaptation in ASR
Carlos Carvalho, Francisco Teixeira, Thomas Rolland, Alberto Abad

TL;DR
This paper investigates model merging as a scalable method for multi-domain automatic speech recognition, benchmarking various algorithms, proposing a new method BoostedTSV-M, and demonstrating its advantages over traditional fine-tuning.
Contribution
The study benchmarks 11 model merging algorithms for multi-domain ASR, introduces BoostedTSV-M to improve merging stability, and shows it outperforms fine-tuning in European Portuguese tasks.
Findings
BoostedTSV-M mitigates rank collapse and enhances stability.
Model merging outperforms full fine-tuning in European Portuguese.
The approach maintains out-of-distribution generalization.
Abstract
Model merging is a scalable alternative to multi-task training that combines the capabilities of multiple specialised models into a single model. This is particularly attractive for large speech foundation models, which are typically adapted through domain-specific fine-tuning, resulting in multiple customised checkpoints, for which repeating full fine-tuning when new data becomes available is computationally prohibitive. In this work, we study model merging for multi-domain ASR and benchmark 11 merging algorithms for 10 European Portuguese domains, evaluating in-domain accuracy, robustness under distribution shift, as well as English and multilingual performance. We further propose BoostedTSV-M, a new merging algorithm based on TSV-M that mitigates rank collapse via singular-value boosting and improves numerical stability. Overall, our approach outperforms full fine-tuning on European…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Speech Recognition and Synthesis · Topic Modeling
