Unifying Model and Layer Fusion for Speech Foundation Models

Yi-Jen Shih; David Harwath

arXiv:2511.08389·eess.AS·November 12, 2025

Unifying Model and Layer Fusion for Speech Foundation Models

Yi-Jen Shih, David Harwath

PDF

Open Access

TL;DR

This paper introduces a unified fusion interface for speech foundation models that combines multiple models and their layers, significantly improving performance across various speech tasks.

Contribution

It proposes a novel interface module that unifies model and layer fusion, enabling better integration of multiple speech models for enhanced task performance.

Findings

01

Outperforms prior fusion methods on speech tasks

02

Scalable with model size and number of models

03

Performance depends on appropriate upstream model selection

Abstract

Speech Foundation Models have gained significant attention recently. Prior works have shown that the fusion of representations from multiple layers of the same model or the fusion of multiple models can improve performance on downstream tasks. We unify these two fusion strategies by proposing an interface module that enables fusion across multiple upstream speech models while integrating information across their layers. We conduct extensive experiments on different self-supervised and supervised models across various speech tasks, including ASR and paralinguistic analysis, and demonstrate that our method outperforms prior fusion approaches. We further analyze its scalability concerning model size and count, highlighting the importance of selecting appropriate upstream models. Our results show that the proposed interface provides an additional performance boost when given a suitable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Topic Modeling