Beyond Hard Sharing: Efficient Multi-Task Speech-to-Text Modeling with Supervised Mixture of Experts

Hojun Jin; Eunsoo Hong; Ziwon Hyung; Sungjun Lim; Seungjin Lee; Keunseok Cho

arXiv:2508.10009·cs.CL·August 15, 2025

Beyond Hard Sharing: Efficient Multi-Task Speech-to-Text Modeling with Supervised Mixture of Experts

Hojun Jin, Eunsoo Hong, Ziwon Hyung, Sungjun Lim, Seungjin Lee, Keunseok Cho

PDF

TL;DR

This paper introduces a Supervised Mixture of Experts (S-MoE) approach for multi-task speech-to-text modeling, which improves performance by routing tasks to dedicated experts without traditional gating functions.

Contribution

The paper proposes S-MoE, a novel method that assigns each task to a specific expert, overcoming limitations of hard sharing and enhancing multi-task speech recognition and translation.

Findings

01

Achieved 6.35% relative WER reduction in speech tasks.

02

Effectively processes mixed-bandwidth speech inputs.

03

Outperforms traditional hard-sharing models.

Abstract

Hard-parameter sharing is a common strategy to train a single model jointly across diverse tasks. However, this often leads to task interference, impeding overall model performance. To address the issue, we propose a simple yet effective Supervised Mixture of Experts (S-MoE). Unlike traditional Mixture of Experts models, S-MoE eliminates the need for training gating functions by utilizing special guiding tokens to route each task to its designated expert. By assigning each task to a separate feedforward network, S-MoE overcomes the limitations of hard-parameter sharing. We further apply S-MoE to a speech-to-text model, enabling the model to process mixed-bandwidth input while jointly performing automatic speech recognition (ASR) and speech translation (ST). Experimental results demonstrate the effectiveness of the proposed S-MoE, achieving a 6.35% relative improvement in Word Error Rate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.