Unifying Model and Layer Fusion for Speech Foundation Models
Yi-Jen Shih, David Harwath

TL;DR
This paper introduces a unified fusion interface for speech foundation models that combines multiple models and their layers, significantly improving performance across various speech tasks.
Contribution
It proposes a novel interface module that unifies model and layer fusion, enabling better integration of multiple speech models for enhanced task performance.
Findings
Outperforms prior fusion methods on speech tasks
Scalable with model size and number of models
Performance depends on appropriate upstream model selection
Abstract
Speech Foundation Models have gained significant attention recently. Prior works have shown that the fusion of representations from multiple layers of the same model or the fusion of multiple models can improve performance on downstream tasks. We unify these two fusion strategies by proposing an interface module that enables fusion across multiple upstream speech models while integrating information across their layers. We conduct extensive experiments on different self-supervised and supervised models across various speech tasks, including ASR and paralinguistic analysis, and demonstrate that our method outperforms prior fusion approaches. We further analyze its scalability concerning model size and count, highlighting the importance of selecting appropriate upstream models. Our results show that the proposed interface provides an additional performance boost when given a suitable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Topic Modeling
